1 Introduction

Today we will begin our exploration of the {dplyr} package! Our first verb on the list is select which allows to keep or drop variables from your dataframe. Choosing your variables is the first step in cleaning your data.

Fig: the select() function.

Let’s go !

2 Learning objectives

You can keep or drop columns from a dataframe using the dplyr::select() function from the {dplyr} package.
You can select a range or combination of columns using operators like the colon (:), the exclamation mark (!), and the c() function.
You can select columns based on patterns in their names with helper functions like starts_with(), ends_with(), contains(), and everything().
You can use rename() and select() to change column names.

3 The Yaounde COVID-19 dataset

In this lesson, we analyse results from a COVID-19 serological survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for IgG and IgM antibodies. The full dataset can be obtained from Zenodo, and the paper can be viewed here.

Spend some time browsing through this dataset. Each line corresponds to one patient surveyed. There are some demographic, socio-economic and COVID-related variables. The results of the IgG and IgM antibody tests are in the columns igg_result and igm_result.

yaounde <- read_csv(here::here("data/yaounde_data.csv"))
yaounde

Left: the Yaounde survey team. Right: an antibody test being administered.

4 Introducing `select()`

Fig: the select() function. (Drawing adapted from Allison Horst).

dplyr::select() lets us pick which columns (variables) to keep or drop.

We can select a column by name:

yaounde %>% select(age)

Or we can select a column by position:

yaounde %>% select(3) # `age` is the 3rd column

To select multiple variables, we separate them with commas:

yaounde %>% select(age, sex, igg_result)

## # A tibble: 971 × 3
##      age sex    igg_result
##    <dbl> <chr>  <chr>     
##  1    45 Female Negative  
##  2    55 Male   Positive  
##  3    23 Male   Negative  
##  4    20 Female Positive  
##  5    55 Female Positive  
##  6    17 Female Negative  
##  7    13 Female Positive  
##  8    28 Male   Negative  
##  9    30 Male   Negative  
## 10    13 Female Positive  
## # … with 961 more rows

Select the weight and height variables in the yaounde data frame.
Select the 16th and 22nd columns in the yaounde data frame.

For the next part of the tutorial, let’s create a smaller subset of the data, called yao.

yao <-
  yaounde %>% select(age,
                     sex,
                     highest_education,
                     occupation,
                     is_smoker,
                     is_pregnant,
                     igg_result,
                     igm_result)
yao

4.1 Selecting column ranges with `:`

The : operator selects a range of consecutive variables:

yao %>% select(age:occupation) # Select all columns from `age` to `occupation`

We can also specify a range with column numbers:

yao %>% select(1:4) # Select columns 1 to 4

With the yaounde data frame, select the columns between symptoms and sequelae, inclusive. (“Inclusive” means you should also include symptoms and sequelae in the selection.)

4.2 Excluding columns with `!`

The exclamation point negates a selection:

yao %>% select(!age) # Select all columns except `age`

To drop a range of consecutive columns, we use, for example,!age:occupation:

yao %>% select(!age:occupation) # Drop columns from `age` to `occupation`

To drop several non-consecutive columns, place them inside !c():

yao %>% select(!c(age, sex, igg_result))

From the yaounde data frame, remove all columns between highest_education and consultation, inclusive.

5 Helper functions for `select()`

dplyr has a number of helper functions to make selecting easier by using patterns from the column names. Let’s take a look at some of these.

5.1 `starts_with()` and `ends_with()`

These two helpers work exactly as their names suggest!

yao %>% select(starts_with("is_")) # Columns that start with "is"

yao %>% select(ends_with("_result")) # Columns that end with "result"

5.2 `contains()`

contains() helps select columns that contain a certain string:

yaounde %>% select(contains("drug")) # Columns that contain the string "drug"

5.3 `everything()`

Another helper function, everything(), matches all variables that have not yet been selected.

# First, `is_pregnant`, then every other column.
yao %>% select(is_pregnant, everything())

It is often useful for establishing the order of columns.

Say we wanted to bring the is_pregnant column to the start of the yao data frame, we could type out all the column names manually:

yao %>% select(is_pregnant, 
               age, 
               sex, 
               highest_education, 
               occupation, 
               is_smoker, 
               igg_result, 
               igm_result)

But this would be painful for larger data frames, such as our original yaounde data frame. In such a case, we can use everything():

# Bring `is_pregnant` to the front of the data frame
yaounde %>% select(is_pregnant, everything())

This helper can be combined with many others.

# Bring columns that end with "result" to the front of the data frame
yaounde %>% select(ends_with("result"), everything())

Select all columns in the yaounde data frame that start with “is_”.
Move the columns that start with “is_” to the beginning of the yaounde data frame.

6 Change column names with `rename()`

Fig: the rename() function. (Drawing adapted from Allison Horst)

dplyr::rename() is used to change column names:

# Rename `age` and `sex` to `patient_age` and `patient_sex`
yaounde %>% 
  rename(patient_age = age, 
         patient_sex = sex)

The fact that the new name comes first in the function (rename(NEWNAME = OLDNAME)) is sometimes confusing. You should get used to this with time.

6.1 Rename within `select()`

You can also rename columns while selecting them:

# Select `age` and `sex`, and rename them to `patient_age` and `patient_sex`
yaounde %>% 
  select(patient_age = age, 
         patient_sex = sex)

Wrap Up !

I hope this first lesson has allowed you to see how intuitive and useful the {dplyr} verbs are! This is the first of a series of basic data wrangling verbs: see you in the next lesson to learn more.

Fig: Basic Data Wrangling Dplyr Verbs.

Contributors

The following team members contributed to this lesson:

LAURE VANCAUWENBERGHE
Data analyst, the GRAPH Network
A firm believer in science for good, striving to ally programming, health and education

ANDREE VALLE CAMPOS
R Developer and Instructor, the GRAPH Network
Motivated by reproducible science and education

KENE DAVID NWOSU
Data analyst, the GRAPH Network
Passionate about world improvement

References

Some material in this lesson was adapted from the following sources:

Horst, A. (2021). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original work published 2020)
Subset columns using their names and types—Select. (n.d.). Retrieved 31 December 2021, from https://dplyr.tidyverse.org/reference/select.html

Artwork was adapted from:

Horst, A. (2021). R & stats illustrations by Allison Horst. https://github.com/allisonhorst/stats-illustrations (Original work published 2018)

Lesson notes | Selecting and renaming columns

1 Introduction

2 Learning objectives

3 The Yaounde COVID-19 dataset

4 Introducing select()

4.1 Selecting column ranges with :

4.2 Excluding columns with !

5 Helper functions for select()

5.1 starts_with() and ends_with()

5.2 contains()

5.3 everything()

6 Change column names with rename()

6.1 Rename within select()