Today we will begin our exploration of the {dplyr} package! Our first
verb on the list is select
which allows to keep or drop
variables from your dataframe. Choosing your variables is the first step
in cleaning your data.
Let’s go !
You can keep or drop columns from a dataframe using the
dplyr::select()
function from the {dplyr} package.
You can select a range or combination of columns using operators
like the colon (:
), the exclamation mark (!
),
and the c()
function.
You can select columns based on patterns in their names with
helper functions like starts_with()
,
ends_with()
, contains()
, and
everything()
.
You can use rename()
and select()
to
change column names.
In this lesson, we analyse results from a COVID-19 serological survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for IgG and IgM antibodies. The full dataset can be obtained from Zenodo, and the paper can be viewed here.
Spend some time browsing through this dataset. Each line corresponds
to one patient surveyed. There are some demographic, socio-economic and
COVID-related variables. The results of the IgG and IgM antibody tests
are in the columns igg_result
and
igm_result
.
<- read_csv(here::here("data/yaounde_data.csv"))
yaounde yaounde
select()
dplyr::select()
lets us pick which columns (variables)
to keep or drop.
We can select a column by name:
%>% select(age) yaounde
Or we can select a column by position:
%>% select(3) # `age` is the 3rd column yaounde
To select multiple variables, we separate them with commas:
%>% select(age, sex, igg_result) yaounde
## # A tibble: 971 × 3
## age sex igg_result
## <dbl> <chr> <chr>
## 1 45 Female Negative
## 2 55 Male Positive
## 3 23 Male Negative
## 4 20 Female Positive
## 5 55 Female Positive
## 6 17 Female Negative
## 7 13 Female Positive
## 8 28 Male Negative
## 9 30 Male Negative
## 10 13 Female Positive
## # … with 961 more rows
Select the weight and height variables in the
yaounde
data frame.
Select the 16th and 22nd columns in the yaounde
data
frame.
For the next part of the tutorial, let’s create a smaller subset of
the data, called yao
.
<-
yao %>% select(age,
yaounde
sex,
highest_education,
occupation,
is_smoker,
is_pregnant,
igg_result,
igm_result) yao
:
The :
operator selects a range of consecutive
variables:
%>% select(age:occupation) # Select all columns from `age` to `occupation` yao
We can also specify a range with column numbers:
%>% select(1:4) # Select columns 1 to 4 yao
yaounde
data frame, select the columns between
symptoms
and sequelae
, inclusive. (“Inclusive”
means you should also include symptoms
and
sequelae
in the selection.)!
The exclamation point negates a selection:
%>% select(!age) # Select all columns except `age` yao
To drop a range of consecutive columns, we use, for
example,!age:occupation
:
%>% select(!age:occupation) # Drop columns from `age` to `occupation` yao
To drop several non-consecutive columns, place them inside
!c()
:
%>% select(!c(age, sex, igg_result)) yao
yaounde
data frame, remove
all columns between highest_education
and
consultation
, inclusive.select()
dplyr
has a number of helper functions to make selecting
easier by using patterns from the column names. Let’s take a look at
some of these.
starts_with()
and ends_with()
These two helpers work exactly as their names suggest!
%>% select(starts_with("is_")) # Columns that start with "is" yao
%>% select(ends_with("_result")) # Columns that end with "result" yao
contains()
contains()
helps select columns that contain a certain
string:
%>% select(contains("drug")) # Columns that contain the string "drug" yaounde
everything()
Another helper function, everything()
, matches all
variables that have not yet been selected.
# First, `is_pregnant`, then every other column.
%>% select(is_pregnant, everything()) yao
It is often useful for establishing the order of columns.
Say we wanted to bring the is_pregnant
column to the
start of the yao
data frame, we could type out all the
column names manually:
%>% select(is_pregnant,
yao
age,
sex,
highest_education,
occupation,
is_smoker,
igg_result, igm_result)
But this would be painful for larger data frames, such as our
original yaounde
data frame. In such a case, we can use
everything()
:
# Bring `is_pregnant` to the front of the data frame
%>% select(is_pregnant, everything()) yaounde
This helper can be combined with many others.
# Bring columns that end with "result" to the front of the data frame
%>% select(ends_with("result"), everything()) yaounde
Select all columns in the yaounde
data frame that
start with “is_”.
Move the columns that start with “is_” to the beginning of the
yaounde
data frame.
rename()
dplyr::rename()
is used to change column names:
# Rename `age` and `sex` to `patient_age` and `patient_sex`
%>%
yaounde rename(patient_age = age,
patient_sex = sex)
The fact that the new name comes first in the function
(rename(NEWNAME = OLDNAME)
) is sometimes confusing. You
should get used to this with time.
select()
You can also rename columns while selecting them:
# Select `age` and `sex`, and rename them to `patient_age` and `patient_sex`
%>%
yaounde select(patient_age = age,
patient_sex = sex)
I hope this first lesson has allowed you to see how intuitive and useful the {dplyr} verbs are! This is the first of a series of basic data wrangling verbs: see you in the next lesson to learn more.
The following team members contributed to this lesson:
Some material in this lesson was adapted from the following sources:
Horst, A. (2021). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original work published 2020)
Subset columns using their names and types—Select. (n.d.). Retrieved 31 December 2021, from https://dplyr.tidyverse.org/reference/select.html
Artwork was adapted from: