1 Introduction

Today we will begin our exploration of the {dplyr} package! Our first verb on the list is select which allows to keep or drop variables from your dataframe. Choosing your variables is the first step in cleaning your data.

Fig: the select() function.

Let’s go !

2 Learning objectives

  • You can keep or drop columns from a dataframe using the dplyr::select() function from the {dplyr} package.

  • You can select a range or combination of columns using operators like the colon (:), the exclamation mark (!), and the c() function.

  • You can select columns based on patterns in their names with helper functions like starts_with(), ends_with(), contains(), and everything().

  • You can use rename() and select() to change column names.

3 The Yaounde COVID-19 dataset

In this lesson, we analyse results from a COVID-19 serological survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for IgG and IgM antibodies. The full dataset can be obtained from Zenodo, and the paper can be viewed here.

Spend some time browsing through this dataset. Each line corresponds to one patient surveyed. There are some demographic, socio-economic and COVID-related variables. The results of the IgG and IgM antibody tests are in the columns igg_result and igm_result.

yaounde <- read_csv(here::here("data/yaounde_data.csv"))
yaounde  

Left: the Yaounde survey team. Right: an antibody test being administered.

4 Introducing select()

Fig: the select() function. (Drawing adapted from Allison Horst).

dplyr::select() lets us pick which columns (variables) to keep or drop.

We can select a column by name:

yaounde %>% select(age) 

Or we can select a column by position:

yaounde %>% select(3) # `age` is the 3rd column

To select multiple variables, we separate them with commas:

yaounde %>% select(age, sex, igg_result)
## # A tibble: 971 × 3
##      age sex    igg_result
##    <dbl> <chr>  <chr>     
##  1    45 Female Negative  
##  2    55 Male   Positive  
##  3    23 Male   Negative  
##  4    20 Female Positive  
##  5    55 Female Positive  
##  6    17 Female Negative  
##  7    13 Female Positive  
##  8    28 Male   Negative  
##  9    30 Male   Negative  
## 10    13 Female Positive  
## # … with 961 more rows
  • Select the weight and height variables in the yaounde data frame.

  • Select the 16th and 22nd columns in the yaounde data frame.


For the next part of the tutorial, let’s create a smaller subset of the data, called yao.

yao <-
  yaounde %>% select(age,
                     sex,
                     highest_education,
                     occupation,
                     is_smoker,
                     is_pregnant,
                     igg_result,
                     igm_result)
yao

4.1 Selecting column ranges with :

The : operator selects a range of consecutive variables:

yao %>% select(age:occupation) # Select all columns from `age` to `occupation`

We can also specify a range with column numbers:

yao %>% select(1:4) # Select columns 1 to 4
  • With the yaounde data frame, select the columns between symptoms and sequelae, inclusive. (“Inclusive” means you should also include symptoms and sequelae in the selection.)

4.2 Excluding columns with !

The exclamation point negates a selection:

yao %>% select(!age) # Select all columns except `age`

To drop a range of consecutive columns, we use, for example,!age:occupation:

yao %>% select(!age:occupation) # Drop columns from `age` to `occupation`

To drop several non-consecutive columns, place them inside !c():

yao %>% select(!c(age, sex, igg_result))
  • From the yaounde data frame, remove all columns between highest_education and consultation, inclusive.

5 Helper functions for select()

dplyr has a number of helper functions to make selecting easier by using patterns from the column names. Let’s take a look at some of these.

5.1 starts_with() and ends_with()

These two helpers work exactly as their names suggest!

yao %>% select(starts_with("is_")) # Columns that start with "is"
yao %>% select(ends_with("_result")) # Columns that end with "result"

5.2 contains()

contains() helps select columns that contain a certain string:

yaounde %>% select(contains("drug")) # Columns that contain the string "drug"

5.3 everything()

Another helper function, everything(), matches all variables that have not yet been selected.

# First, `is_pregnant`, then every other column.
yao %>% select(is_pregnant, everything())

It is often useful for establishing the order of columns.

Say we wanted to bring the is_pregnant column to the start of the yao data frame, we could type out all the column names manually:

yao %>% select(is_pregnant, 
               age, 
               sex, 
               highest_education, 
               occupation, 
               is_smoker, 
               igg_result, 
               igm_result)

But this would be painful for larger data frames, such as our original yaounde data frame. In such a case, we can use everything():

# Bring `is_pregnant` to the front of the data frame
yaounde %>% select(is_pregnant, everything())

This helper can be combined with many others.

# Bring columns that end with "result" to the front of the data frame
yaounde %>% select(ends_with("result"), everything())
  • Select all columns in the yaounde data frame that start with “is_”.

  • Move the columns that start with “is_” to the beginning of the yaounde data frame.

6 Change column names with rename()

Fig: the rename() function. (Drawing adapted from Allison Horst)

dplyr::rename() is used to change column names:

# Rename `age` and `sex` to `patient_age` and `patient_sex`
yaounde %>% 
  rename(patient_age = age, 
         patient_sex = sex)

The fact that the new name comes first in the function (rename(NEWNAME = OLDNAME)) is sometimes confusing. You should get used to this with time.

6.1 Rename within select()

You can also rename columns while selecting them:

# Select `age` and `sex`, and rename them to `patient_age` and `patient_sex`
yaounde %>% 
  select(patient_age = age, 
         patient_sex = sex)

Wrap Up !

I hope this first lesson has allowed you to see how intuitive and useful the {dplyr} verbs are! This is the first of a series of basic data wrangling verbs: see you in the next lesson to learn more.

Fig: Basic Data Wrangling Dplyr Verbs.

Contributors

The following team members contributed to this lesson:

References

Some material in this lesson was adapted from the following sources:

Artwork was adapted from: