In the last lesson, you learned the basics of data transformation
using the {dplyr} function mutate()
.
In that lesson, we mostly looked at global transformations; that is, transformations that did the same thing to an entire variable. In this lesson, we will look at how to conditionally manipulate certain rows based on whether or not they meet defined criteria.
For this, we will mostly use the case_when()
function,
which you will likely come to see as one of the most important functions
in {dplyr} for data wrangling tasks.
Let’s get started.
You can transform or create new variables based on conditions
using dplyr::case_when()
You know how to use the TRUE
condition in
case_when()
to match unmatched cases.
You can handle NA
values in case_when()
transformations.
You understand how to keep the default values of a variable in a
case_when()
formula
You can write case_when()
conditions involving
multiple comparators and multiple variables.
You understand case_when()
conditions priority
order.
You can use dplyr::if_else()
for binary conditional
assignment.
This lesson will require the tidyverse suite of packages:
if(!require(pacman)) install.packages("pacman")
::p_load(tidyverse) pacman
In this lesson, we will again use data from the COVID-19 serological survey conducted in Yaounde, Cameroon.
# Import and view the dataset
<-
yaounde read_csv(here::here('data/yaounde_data.csv')) %>%
## make every 5th age missing
mutate(age = case_when(row_number() %in% seq(5, 900, by = 5) ~ NA_real_,
TRUE ~ age)) %>%
## rename the age variable
rename(age_years = age) %>%
# drop the age category column
select(-age_category)
yaounde
Note that in the code chunk above, we slightly modified the age
column, artificially introducing some missing values, and we also
dropped the age_category
column. This is to help illustrate
some key points in the tutorial.
For practice questions, we will also use an outbreak linelist of 136 cases of influenza A H7N9 from a 2013 outbreak in China. This is a modified version of a dataset compiled by Kucharski et al. (2014).
# Import and view the dataset
<- read_csv(here::here('data/flu_h7n9_china_2013.csv'))
flu_linelist flu_linelist
Throughout this lesson, you will use a lot of relational operators in
R. Recall that relational operators, sometimes called “comparators”,
test the relation between two values, and return TRUE
,
FALSE
or NA
.
A list of the most common operators is given below:
Operator | is TRUE if |
A < B | A is less than B |
A <= B | A is less than or equal to B |
A > B | A is greater than B |
A >= B | A is greater than or equal to B |
A == B | A is equal to B |
A != B | A is not equal to B |
A %in% B | A is an element of B |
case_when()
To get familiar with case_when(),
let’s begin with a
simple conditional transformation on the age_years
column
of the yaounde
dataset. First we subset the data frame to
just the age_years
column for easy illustration:
<-
yaounde_age %>%
yaounde select(age_years)
yaounde_age
Now, using case_when()
, we can make a new column, called
“age_group”, that has the value “Child” if the person is below 18, and
“Adult” if the person is 18 and up:
%>%
yaounde_age mutate(age_group = case_when(age_years < 18 ~ "Child",
>= 18 ~ "Adult")) age_years
The case_when()
syntax may seem a bit foreign, but it is
quite simple: on the left-hand side (LHS) of the ~
sign
(called a “tilde”), you provide the condition(s) you want to evaluate,
and on the right-hand side (RHS), you provide a value to put in if the
condition is true.
So the statement
case_when(age_years < 18 ~ "Child", age_years >= 18 ~ "Adult")
can be read as: “if age_years
is below 18, input ‘Child’,
else if age_years
is greater than or equal to 18, input
‘Adult’”.
Formulas, LHS and RHS
Each line of a case_when()
call is termed a “formula”
or, sometimes, a “two-sided formula”. And each formula has a left-hand
side (abbreviated LHS) and right-hand side (abbreviated RHS).
For example, the code age_years < 18 ~ "Child"
is a
“formula”, its LHS is age_years < 18
while its RHS is
"Child"
.
You are likely to come across these terms when reading the
documentation for the case_when()
function, and we will
also refer to them in this lesson.
After creating a new variable with case_when()
, it is a
good idea to inspect it thoroughly to make sure it worked as
intended.
To inspect the variable, you can pipe your data frame into the
View()
function to view it in spreadsheet form:
%>%
yaounde_age mutate(age_group = case_when(age_years < 18 ~ "Child",
>= 18 ~ "Adult")) %>%
age_years View()
This would open up a new tab in RStudio where you should manually
scan through the new column, age_group
and the referenced
column age_years
to make sure your case_when()
statement did what you wanted it to do.
You could also pass the new column into the tabyl()
function to ensure that the proportions “make sense”:
%>%
yaounde_age mutate(age_group = case_when(age_years < 18 ~ "Child",
>= 18 ~ "Adult")) %>%
age_years tabyl(age_group)
With the flu_linelist
data, make a new column, called
age_group
, that has the value “Below 50” for people under
50 and “50 and above” for people aged 50 and up. Use the
case_when()
function.
# Complete the code with your answer:
<-
Q_age_group %>%
flu_linelist mutate(age_group = ______________________________)
Out of the entire sample of individuals in the
flu_linelist
dataset, what percentage are confirmed to be
below 60? (Repeat the above procedure but with the 60
cutoff, then call tabyl()
on the age group variable. Use
the percent
column, not the valid_percent
column.)
# Enter your answer as a number without quotes:
<- YOUR_ANSWER_HERE Q_age_group_percentage
TRUE
default argumentIn a case_when()
statement, you can use a literal
TRUE
condition to match any rows not yet matched with
provided conditions.
For example, if we only keep only the first condition from the
previous example, age_years < 18
, and define the default
value to be TRUE ~ "Not child"
then all adults and
NA
values in the data set will be labeled
"Not child"
by default.
%>%
yaounde_age mutate(age_group = case_when(age_years < 18 ~ "Child",
TRUE ~ "Not child"))
This TRUE
condition can be read as “for everything
else…”.
So the full case_when()
statement used above,
age_years < 18 ~ "Child", TRUE ~ "Not child"
, would then
be read as: “if age is below 18, input ‘Child’ and for everyone else
not yet matched, input ‘Not child’”.
It is important to use TRUE
as the final
condition in case_when()
. If you use it as the first
condition, it will take precedence over all others, as seen here:
%>%
yaounde_age mutate(age_group = case_when(TRUE ~ "Not child",
< 18 ~ "Child")) age_years
As you can observe, all individuals are now coded with “Not child”,
because the TRUE
condition was placed first, and therefore
took precedence. We will explore the issue of precedence further
below.
is.na()
We can match missing values manually with is.na()
. Below
we match NA
ages with is.na()
and set their
age group to “Missing age”:
%>%
yaounde_age mutate(age_group = case_when(age_years < 18 ~ "Child",
>= 18 ~ "Adult",
age_years is.na(age_years) ~ "Missing age"))
As before, using the flu_linelist
data, make a new
column, called age_group
, that has the value “Below 60” for
people under 60 and “60 and above” for people aged 60 and up. But this
time, also set those with missing ages to “Missing age”.
# Complete the code with your answer:
<-
Q_age_group_nas %>% flu_linelist
The gender
column of the flu_linelist
dataset contains the values “f”, “m” and NA
:
%>%
flu_linelist tabyl(gender)
Recode “f”, “m” and NA
to “Female”, “Male” and “Missing
gender” respectively. You should modify the existing gender
column, not create a new column.
# Complete the code with your answer:
<-
Q_gender_recode %>% flu_linelist
The right-hand side (RHS) of a case_when()
formula can
also take in a variable from your data frame. This is often useful when
you want to change just a few values in a column.
Let’s see an example with the highest_education
column,
which contains the highest education level attained by a respondent:
<-
yaounde_educ %>%
yaounde select(highest_education)
yaounde_educ
Below, we create a new column, highest_educ_recode
,
where we recode both “University” and “Doctorate” to the value
“Post-secondary”:
%>%
yaounde_educ mutate(highest_educ_recode =
case_when(
%in% c("University", "Doctorate") ~ "Post-secondary"
highest_education ))
It worked, but now we have NA
s for all other rows. To
keep these other rows at their default values, we can add the line
TRUE ~ highest_education
(with a variable,
highest_education
, on the right-hand side of a
formula):
%>%
yaounde_educ mutate(highest_educ_recode =
case_when(
%in% c("University", "Doctorate") ~ "Post-secondary",
highest_education TRUE ~ highest_education
))
Now the case_when()
statement reads: ‘If highest
education is “University” or “Doctorate”, input “Post-secondary”. For
everyone else, input the value from highest_education
’.
Above we have been putting the recoded values in a separate column,
highest_educ_recode
, but for this kind of replacement, it
is more common to simply overwrite the existing column:
%>%
yaounde_educ mutate(highest_education =
case_when(
%in% c("University", "Doctorate") ~ "Post-secondary",
highest_education TRUE ~ highest_education
))
We can read this last case_when()
statement as: ‘If
highest education is “University” or “Doctorate”, change the value
to “Post-secondary”. For everyone else, leave in the value
from highest_education
’.
Using the flu_linelist
data, modify the existing column
outcome
by replacing the value “Recover” with
“Recovery”.
# Complete the code with your answer:
<-
Q_recode_recovery flu_linelist
(We know it’s a lot of code for such a simple change. Later you will see easier ways to do this.)
Avoiding long code lines As you start to write
increasingly complex case_when()
statements, it will become
helpful to use line breaks to avoid long lines of code.
To assist with creating line breaks, you can use the {styler}
package. Install it with pacman::p_load(styler)
. Then to
reformat any piece of code, highlight the code, click the “Addins”
button in RStudio, then click on “Style selection”:
Alternatively, you could highlight the code and use the shortcut
Shift
+ Command/Control
+ A
to
use RStudio’s built-in code reformatter.
Sometimes {styler} does a better job at reformatting. Sometimes the built-in reformatter does a better job.
LHS conditions in case_when()
formulas can have multiple
parts. Let’s see an example of this.
But first, we will inspire ourselves from what we learnt in the
mutate()
lesson and recreate the BMI variable. This
involves first converting the height_cm
variable to meters,
then calculating BMI.
<-
yaounde_BMI %>%
yaounde mutate(height_m = height_cm/100,
BMI = (weight_kg / (height_m)^2)) %>%
select(BMI)
yaounde_BMI
Recall the following BMI categories:
If the BMI is inferior to 18.5, the person is considered underweight.
A normal BMI is greater than or equal to 18.5 and less than 25.
An overweight BMI is greater than or equal to 25 and less than 30.
An obese BMI is BMI is greater than or equal to 30.
The condition BMI >= 18.5 & BMI < 25
to define
Normal weight
is a compound condition because it has
two comparators: >=
and <
.
<-
yaounde_BMI %>%
yaounde_BMI mutate(BMI_classification = case_when(BMI < 18.5 ~'Underweight',
>= 18.5 & BMI < 25 ~ 'Normal weight',
BMI >= 25 & BMI < 30 ~ 'Overweight',
BMI >= 30 ~ 'Obese'))
BMI yaounde_BMI
Let’s use tabyl()
to have a look at our data:
%>%
yaounde_BMI tabyl(BMI_classification)
But you can see that the levels of BMI are defined in alphabetical
order from Normal weight to Underweight, instead of from lightest
(Underweight) to heaviest (Obese). Remember that if you want to have a
certain order you can make BMI_classification
a factor
using mutate()
and define its levels.
%>%
yaounde_BMI mutate(BMI_classification = factor(BMI_classification, levels=c("Obese",
"Overweight",
"Normal weight",
"Underweight"))) %>%
tabyl(BMI_classification)
With compound conditions, you should remember to input the variable name everytime there is a comparator. R learners often forget this and will try to run code that looks like this:
%>%
yaounde_BMI mutate(BMI_classification = case_when(BMI < 18.5 ~'Underweight',
>= 18.5 & < 25 ~ 'Normal weight',
BMI >= 25 & < 30 ~ 'Overweight',
BMI >= 30 ~ 'Obese')) BMI
The definitions for the “Normal weight” and “Overweight” categories are mistaken. Do you see the problem? Try to run the code to spot the error.
With the flu_linelist
data, make a new column, called
adolescent
, that has the value “Yes” for people in the
10-19 (at least 10 and less than 20) age group, and “No” for everyone
else.
# Complete the code with your answer:
<-
Q_adolescent_grouping %>% flu_linelist
In all examples seen so far, you have only used conditions involving a single variable at a time. But LHS conditions often refer to multiple variables at once.
Let’s see a simple example with age and sex in the
yaounde
data frame. First, we select just these two
variables for easy illustration:
<-
yaounde_age_sex %>%
yaounde select(age_years, sex)
yaounde_age_sex
Now, imagine we want to recruit women and men in the 20-29 age group
into two studies. For this we’d like to create a column, called
recruit
, with the following schema:
To do this, we run the following case_when statement:
%>%
yaounde_age_sex mutate(recruit = case_when(
== "Female" & age_years >= 20 & age_years <= 29 ~ "Recruit to female study",
sex == "Male" & age_years >= 20 & age_years <= 29 ~ "Recruit to male study",
sex TRUE ~ "Do not recruit"
))
You could also add extra pairs of parentheses around the age criteria within each condition:
%>%
yaounde_age_sex mutate(recruit = case_when(
== "Female" & (age_years >= 20 & age_years <= 29) ~ "Recruit to female study",
sex == "Male" & (age_years >= 20 & age_years <= 29) ~ "Recruit to male study",
sex TRUE ~ "Do not recruit"
))
This extra pair of parentheses does not change the code output, but
it improves coherence because the reader can visually see that your
condition is made of two parts, one for gender,
sex == "Female"
, and another for age,
(age_years >= 20 & age_years <= 29)
.
With the flu_linelist
data, make a new column, called
recruit
with the following schema:
# Complete the code with your answer:
<-
Q_age_province_grouping %>%
flu_linelist mutate(recruit = ______________________________)
case_when()
Note that the order of conditions is important, because conditions
listed at the top of your case_when()
statement take
priority over others.
To understand this, run the example below:
%>%
yaounde_age_sex mutate(age_group = case_when(age_years < 18 ~ "Child",
< 30 ~ "Young adult",
age_years < 120 ~ "Older adult")) age_years
This initially looks like a faulty case_when()
statement
because the age conditions overlap. For example, the statement
age_years < 120 ~ "Older adult"
(which reads “if age is
below 120, input ‘Older adult’”) suggests that anyone between
ages 0 and 120 (even a 1-year old baby!, would be coded as “Older
adult”.
But as you saw, the code actually works fine! People under 18 are still coded as “Child”.
What’s going on? Essentially, the case_when()
statement
is interpreted as a series of branching logical steps, starting with the
first condition. So this particular statement can be read as: “If age is
below 18, input ‘Child’, and otherwise, if age is below 30,
input ‘Young adult’, and otherwise, if age is below 120,
input”Older adult”.
This is illustrated in the schematic below:
This means that if you swap the order of the conditions, you will end
up with a faulty case_when()
statement:
%>%
yaounde_age mutate(age_group = case_when(age_years < 120 ~ "Older adult",
< 30 ~ "Young adult",
age_years < 18 ~ "Child")) age_years
As you can see, everyone is coded as “Older adult”. This happens because the first condition matches everyone, so there is no one left to match with the subsequent conditions. The statement can be read “If age is below 120, input ‘Older adult’, and otherwise if age is below 30….” But there is no “otherwise” because everyone has already been matched!
This is illustrated in the diagram below:
Although we have spent much time explaining the importance of the order of conditions, in this specific example, there would be a much clearer way to write this code that would not depend on the order of conditions. Rather than leave the age groups open-ended like this:
age_years < 120 ~ "Older adult"
you should actually use closed age bounds like this:
age_years >= 30 & age_years < 120 ~ "Older adult"
which is read: “if age is greater than or equal to 30 and less than 120, input ‘Older adult’”.
With such closed conditions, the order of conditions no longer matters. You get the same result no matter how you arrange the conditions:
# start with "Older adult" condition
%>%
yaounde_age mutate(age_group = case_when(age_years >= 30 & age_years < 120 ~ "Older adult",
>= 18 & age_years < 30 ~ "Young adult",
age_years >= 0 & age_years < 18 ~ "Child")) age_years
# start with "Child" condition
%>%
yaounde_age mutate(age_group = case_when(age_years >= 0 & age_years < 18 ~ "Child",
>= 18 & age_years < 30 ~ "Young adult",
age_years >= 30 & age_years < 120 ~ "Older adult")) age_years
Nice and clean!
So why did we spend so much time explaining the importance of condition order if you can simply avoid open-ended categories and not have to worry about condition order?
One reason is that understanding condition order should now help you
see why it is important to put the TRUE
condition as the
final line in your case_when()
statement. The
TRUE
condition matches every row that has not yet been
matched, so if you use it first in the case_when()
,
it will match everyone!
The other reason is that there are certain cases where you
may want to use open-ended overlapping conditions, and so you
will have to pay attention to the order of conditions. Let’s see one
such example now: identifying COVID-like symptoms. Note that this is
somewhat advanced material, likely a bit above your current needs. We
are introducing it now so you are aware and can stay vigilant with
case_when()
in the future.
case_when()
We want to identify COVID-like symptoms in our data. Consider the
symptoms columns in the yaounde
data frame, which indicates
which symptoms were experienced by respondents over a 6-month
period:
%>%
yaounde select(starts_with("symp_"))
We would like to use this to assess whether a person may have had COVID, partly following guidelines recommended by the WHO.
Now, keeping these criteria in mind, consider an individual, let’s call her Osma, who has cough AND anosmia/ageusia? How should we classify Osma?
She meets the criteria for “possible COVID” (because she has cough), but she also meets the criteria for “probable COVID” (because she has anosmia/ageusia). So which group should she be classed as, “possible COVID” or “probable COVID”? Think about it for a minute.
Hopefully you guessed that she should be classed as a “probable COVID case”. “Probable” is more likely than “Possible”; and the anosmia/ageusia symptom is more significant than the cough symptom. One might say that the criterion for “probable COVID” has a higher specificity or a higher precedence than the criterion for “possible COVID”.
Therefore, when constructing a case_when()
statement,
the “probable COVID” condition should also take higher precedence—it
should come first in the conditions provided to
case_when()
. Let’s see this now.
First we select the relevant variables, for easy illustration. We
also identify and slice()
specific rows that are useful for
the demonstration:
<-
yaounde_symptoms_slice %>%
yaounde select(symp_cough, symp_anosmia_or_ageusia) %>%
# slice of specific rows useful for demo
# Once you find the right code, you would remove this slice
slice(32, 711, 625, 651 )
yaounde_symptoms_slice
Now, the correct case_when()
statement, which has the
“Probable COVID” condition first:
%>%
yaounde_symptoms_slice mutate(covid_status = case_when(
== "Yes" ~ "Probable COVID",
symp_anosmia_or_ageusia == "Yes" ~ "Possible COVID"
symp_cough ))
This case_when()
statement can be read in simple terms
as ‘If the person has anosmia/ageusia, input “Probable COVID”, and
otherwise, if the person has cough, input “Possible COVID”’.
Now, spend some time looking through the output data frame,
especially the last three individuals. The individual in row 2 meets the
criterion for “Possible COVID” because they have cough
(symp_cough
== “Yes”), and the individual in row 3 meets
the criterion for “Probable COVID” because they have anosmia/ageusia
(symp_anosmia_or_ageusia == "Yes"
).
The individual in row 4 is Osma, who both meets the criteria for
“possible COVID” and for “probable COVID”. And because we
arranged our case_when()
conditions in the right order, she
is coded correctly as “probable COVID”. Great!
But notice what happens if we swap the order of the conditions:
%>%
yaounde_symptoms_slice mutate(covid_status = case_when(
== "Yes" ~ "Possible COVID",
symp_cough == "Yes" ~ "Probable COVID"
symp_anosmia_or_ageusia ))
Oh no! Osma in row 4 is now misclassed as “Possible COVID” even
though she has the more significant anosmia/ageusia symptom. This is
because the first condition symp_cough == "Yes"
matched her
first, and so the second condition was not able to match her!
So now you see why you sometimes need to think deeply about the order
of your case_when()
conditions. It is a minor point, but it
can bite you at unexpected times. Even experienced analysts tend to make
mistakes that can be traced to improper arrangement of
case_when()
statements.
In reality, there is still another solution to avoid
misclassifying the person with cough and anosmia/ageusia. That is to add
symp_anosmia_or_ageusia != "Yes"
(not equal to “Yes”) to
the conditions for “Possible COVID”. Can you think of why this
works?
%>%
yaounde_symptoms_slice mutate(covid_status = case_when(
== "Yes" & symp_anosmia_or_ageusia != "Yes" ~ "Possible COVID",
symp_cough == "Yes" ~ "Probable COVID")) symp_anosmia_or_ageusia
With the flu_linelist
dataset, create a new column
called follow_up_priority
that implements the following
schema:
# Complete the code with your answer:
<-
Q_priority_groups %>%
flu_linelist mutate(follow_up_priority = ________________
)
dplyr::if_else()
There is another {dplyr} verb similar to case_when()
for
when we want to apply a binary condition to a variable:
if_else()
. A binary condition is either TRUE
or FALSE
.
if_else()
has a similar application as
case_when()
: if the condition is true, then one operation
is applied, if the condition is false, the alternative is applied. The
syntax is: if_else(CONDITION, IF_TRUE, IF_FALSE)
. As you
can see, this only allows for a binary condition (not multiple cases,
such as handled by case_when()
).
If we take one of the first examples about recoding the
highest_education
variable, we can write it either with
case_when()
or with if_else()
.
Here is the version we already explored:
%>%
yaounde_educ mutate(highest_education =
case_when(
%in% c("University", "Doctorate") ~ "Post-secondary",
highest_education TRUE ~ highest_education
))
And this is how we would write it using if_else()
:
%>%
yaounde_educ mutate(highest_education =
if_else(
%in% c("University", "Doctorate"),
highest_education # if TRUE then we recode
"Post-secondary",
# if FALSE then we keep default value
highest_education ))
As you can see, we get the same output, whether we use
if_else()
or case_when()
.
With the flu_linelist
data, make a new column, called
age_group
, that has the value “Below 50” for people under
50 and “50 and above” for people aged 50 and up. Use the
if_else()
function.
This is exactly the same question as your first practice question,
but this time you need to use if_else()
.
# Complete the code with your answer:
<-
Q_age_group_if_else %>%
flu_linelist mutate(age_group = if_else(______________________________))
Changing or constructing your variables based on conditions on other variables is one of the most repeated data wrangling tasks. To the point it deserved its very own lesson !
I hope now that you will feel comfortable using
case_when()
and if_else()
within
mutate()
and that you are excited to learn more complex
{dplyr} operations such as grouping variables and summarizing them.
See you next time!
The following team members contributed to this lesson:
Some material in this lesson was adapted from the following sources:
Horst, A. (2022). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original work published 2020)
Create, modify, and delete columns — Mutate. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/mutate.html
Artwork was adapted from: