Today we will begin our exploration of the {dplyr} package! Our first
verb on the list is select which allows to keep or drop
variables from your dataframe. Choosing your variables is the first step
in cleaning your data.
Fig: the select() function.
Let’s go !
You can keep or drop columns from a dataframe using the
dplyr::select() function from the {dplyr} package.
You can select a range or combination of columns using operators
like the colon (:), the exclamation mark (!),
and the c() function.
You can select columns based on patterns in their names with
helper functions like starts_with(),
ends_with(), contains(), and
everything().
You can use rename() and select() to
change column names.
In this lesson, we analyse results from a COVID-19 serological survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for IgG and IgM antibodies. The full dataset can be obtained from Zenodo, and the paper can be viewed here.
Spend some time browsing through this dataset. Each line corresponds
to one patient surveyed. There are some demographic, socio-economic and
COVID-related variables. The results of the IgG and IgM antibody tests
are in the columns igg_result and
igm_result.
yaounde <- read_csv(here::here("data/yaounde_data.csv"))
yaounde Left: the Yaounde survey team. Right: an antibody test being administered.
select()Fig: the select() function. (Drawing
adapted from Allison Horst).
dplyr::select() lets us pick which columns (variables)
to keep or drop.
We can select a column by name:
yaounde %>% select(age) Or we can select a column by position:
yaounde %>% select(3) # `age` is the 3rd columnTo select multiple variables, we separate them with commas:
yaounde %>% select(age, sex, igg_result)## # A tibble: 971 × 3
## age sex igg_result
## <dbl> <chr> <chr>
## 1 45 Female Negative
## 2 55 Male Positive
## 3 23 Male Negative
## 4 20 Female Positive
## 5 55 Female Positive
## 6 17 Female Negative
## 7 13 Female Positive
## 8 28 Male Negative
## 9 30 Male Negative
## 10 13 Female Positive
## # … with 961 more rows
Select the weight and height variables in the
yaounde data frame.
Select the 16th and 22nd columns in the yaounde data
frame.
For the next part of the tutorial, let’s create a smaller subset of
the data, called yao.
yao <-
yaounde %>% select(age,
sex,
highest_education,
occupation,
is_smoker,
is_pregnant,
igg_result,
igm_result)
yao:The : operator selects a range of consecutive
variables:
yao %>% select(age:occupation) # Select all columns from `age` to `occupation`We can also specify a range with column numbers:
yao %>% select(1:4) # Select columns 1 to 4yaounde data frame, select the columns between
symptoms and sequelae, inclusive. (“Inclusive”
means you should also include symptoms and
sequelae in the selection.)!The exclamation point negates a selection:
yao %>% select(!age) # Select all columns except `age`To drop a range of consecutive columns, we use, for
example,!age:occupation:
yao %>% select(!age:occupation) # Drop columns from `age` to `occupation`To drop several non-consecutive columns, place them inside
!c():
yao %>% select(!c(age, sex, igg_result))yaounde data frame, remove
all columns between highest_education and
consultation, inclusive.select()dplyr has a number of helper functions to make selecting
easier by using patterns from the column names. Let’s take a look at
some of these.
starts_with() and ends_with()These two helpers work exactly as their names suggest!
yao %>% select(starts_with("is_")) # Columns that start with "is"yao %>% select(ends_with("_result")) # Columns that end with "result"contains()contains() helps select columns that contain a certain
string:
yaounde %>% select(contains("drug")) # Columns that contain the string "drug"everything()Another helper function, everything(), matches all
variables that have not yet been selected.
# First, `is_pregnant`, then every other column.
yao %>% select(is_pregnant, everything())It is often useful for establishing the order of columns.
Say we wanted to bring the is_pregnant column to the
start of the yao data frame, we could type out all the
column names manually:
yao %>% select(is_pregnant,
age,
sex,
highest_education,
occupation,
is_smoker,
igg_result,
igm_result)But this would be painful for larger data frames, such as our
original yaounde data frame. In such a case, we can use
everything():
# Bring `is_pregnant` to the front of the data frame
yaounde %>% select(is_pregnant, everything())This helper can be combined with many others.
# Bring columns that end with "result" to the front of the data frame
yaounde %>% select(ends_with("result"), everything())Select all columns in the yaounde data frame that
start with “is_”.
Move the columns that start with “is_” to the beginning of the
yaounde data frame.
rename()Fig: the rename() function. (Drawing
adapted from Allison Horst)
dplyr::rename()
is used to change column names:
# Rename `age` and `sex` to `patient_age` and `patient_sex`
yaounde %>%
rename(patient_age = age,
patient_sex = sex)The fact that the new name comes first in the function
(rename(NEWNAME = OLDNAME)) is sometimes confusing. You
should get used to this with time.
select()You can also rename columns while selecting them:
# Select `age` and `sex`, and rename them to `patient_age` and `patient_sex`
yaounde %>%
select(patient_age = age,
patient_sex = sex)I hope this first lesson has allowed you to see how intuitive and useful the {dplyr} verbs are! This is the first of a series of basic data wrangling verbs: see you in the next lesson to learn more.
Fig: Basic Data Wrangling Dplyr Verbs.
The following team members contributed to this lesson:
Some material in this lesson was adapted from the following sources:
Horst, A. (2021). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original work published 2020)
Subset columns using their names and types—Select. (n.d.). Retrieved 31 December 2021, from https://dplyr.tidyverse.org/reference/select.html
Artwork was adapted from: