You now know how to keep or drop columns and rows from your dataset.
Today you will learn how to modify existing variables or create new
ones, using the mutate()
verb from {dplyr}. This is an
essential step in most data analysis projects.
Let’s go!
You can use the mutate()
function from the
{dplyr}
package to create new variables or modify existing
variables.
You can create new numeric, character, factor, and boolean variables
This lesson will require the packages loaded below:
if(!require(pacman)) install.packages("pacman")
::p_load(here,
pacman
janitor, tidyverse)
In this lesson, we will again use the data from the COVID-19
serological survey conducted in Yaounde, Cameroon. Below, we import the
dataset yaounde
and create a smaller subset called
yao
. Note that this dataset is slightly different from the
one used in the previous lesson.
<- read_csv(here::here('data/yaounde_data.csv'))
yaounde
## a smaller subset of variables
<- yaounde %>% select(date_surveyed,
yao
age,
weight_kg, height_cm,
symptoms, is_smoker)
yao
We will also use a dataset from a cross-sectional study that aimed to determine the prevalence of sarcopenia in the elderly population (>60 years) in in Karnataka, India. Sarcopenia is a condition that is common in elderly people and is characterized by progressive and generalized loss of skeletal muscle mass and strength. The data was obtained from Zenodo here, and the source publication can be found here.
Below, we import and view this dataset:
<- read_csv(here::here('data/sarcopenia_elderly.csv'))
sarcopenia
sarcopenia
mutate()
We use dplyr::mutate()
to create new variables or modify
existing variables. The syntax is quite intuitive, and generally looks
like
df %>% mutate(new_column_name = what_it_contains)
.
Let’s see a quick example.
The yaounde
dataset currently contains a column called
height_cm
, which shows the height, in centimeters, of
survey respondents. Let’s create a data frame, yao_height
,
with just this column, for easy illustration:
<- yaounde %>% select(height_cm)
yao_height yao_height
What if you wanted to create a new variable, called
height_meters
where heights are converted to meters? You
can use mutate()
for this, with the argument
height_meters = height_cm/100
:
%>%
yao_height mutate(height_meters = height_cm/100)
Great. The syntax is beautifully simple, isn’t it?
Sometimes it is helpful to think of data manipulation functions in
the context of familiar spreadsheet software. Here is what the R command
mutate(height_m = height_cm/100)
would be equivalent to in
Google Sheets:
Now, imagine there was a small error in the equipment used to measure respondent heights, and all heights are 5cm too small. You therefore like to add 5cm to all heights in the dataset. To do this, rather than creating a new variable as you did before, you can modify the existing variable with mutate:
%>%
yao_height mutate(height_cm = height_cm + 5)
Again, very easy to do!
The sarcopenia
data frame has a variable
weight_kg
, which contains respondents’ weights in
kilograms. Create a new column, called weight_grams
, with
respondents’ weights in grams. Store your answer in the
Q_weight_to_g
object. (1 kg equals 1000 grams.)
# Complete the code with your answer:
<-
Q_weight_to_g %>%
sarcopenia _____________________
Hopefully you now see that the mutate function is quite
user-friendly. In theory, we could end the lesson here, because you now
know how to use mutate()
😃. But of course, the devil will
be in the details—the interesting thing is not mutate()
itself but what goes inside the mutate()
call.
The rest of the lesson will go through a few use cases for the
mutate()
verb. In the process, we’ll touch on several new
functions you have not yet encountered.
You can use mutate()
to create a Boolean variable to
categorize part of your population.
Below we create a Boolean variable, is_child
which is
either TRUE
if the subject is a child or FALSE
if the subject is an adult (first, we select just the age
variable so it’s easy to see what is being done; you will likely not
need this pre-selection for your own analyses).
%>%
yao select(age) %>%
mutate(is_child = age <= 18)
The code age <= 18
evaluates whether each age is less
than or equal to 18. Ages that match that condition (ages 18 and under)
are TRUE
and those that fail the condition are
FALSE
.
Such a variable is useful to, for example, count the number of
children in the dataset. The code below does this with the
janitor::tabyl()
function:
%>%
yao mutate(is_child = age <= 18) %>%
tabyl(is_child)
You can observe that 31.8% (0.318…) of respondents in the dataset are children.
Let’s see one more example, since the concept of Boolean variables
can be a bit confusing. The symptoms
variable reports any
respiratory symptoms experienced by the patient:
%>%
yao select(symptoms)
You could create a Boolean variable, called
has_no_symptoms
, that is set to TRUE
if the
respondent reported no symptoms:
%>%
yao select(symptoms) %>%
mutate(has_no_symptoms = symptoms == "No symptoms")
Similarly, you could create a Boolean variable called
has_any_symptoms
that is set to TRUE
if the
respondent reported any symptoms. For this, you’d simply swap the
symptoms == "No symptoms"
code for
symptoms != "No symptoms"
:
%>%
yao select(symptoms) %>%
mutate(has_any_symptoms = symptoms != "No symptoms")
Still confused by the Boolean examples? That’s normal. Pause and play with the code above a little. Then try the practice question below
Women with a grip strength below 20kg are considered to have low grip
strength. With a female subset of the sarcopenia
data
frame, add a variable called low_grip_strength
that is
TRUE
for women with a grip strength < 20 kg and FALSE
for other women.
# Complete the code with your answer:
<-
Q_women_low_grip_strength %>%
sarcopenia filter(sex_male_1_female_0 == 0) # first we filter the dataset to only women
# mutate code here
What percentage of women surveyed have a low grip strength according to the definition above? Enter your answer as a number without quotes (e.g. 43.3 or 12.2), to one decimal place.
<- YOUR_ANSWER_HERE Q_prop_women_low_grip_strength
Now, let’s look at an example of creating a numeric variable, the body mass index (BMI), which a commonly used health indicator. The formula for the body mass index can be written as:
\[
BMI = \frac{weight (kilograms)}{height (meters)^2}
\] You can use mutate()
to calculate BMI in the
yao
dataset as follows:
%>%
yao select(weight_kg, height_cm) %>%
# first obtain the height in meters
mutate(height_meters = height_cm/100) %>%
# then use the BMI formula
mutate(bmi = weight_kg / (height_meters)^2)
Let’s save the data frame with BMIs for later. We will use it in the next section.
<-
yao_bmi %>%
yao select(weight_kg, height_cm) %>%
# first obtain the height in meters
mutate(height_meters = height_cm/100) %>%
# then use the BMI formula
mutate(bmi = weight_kg / (height_meters)^2)
Appendicular muscle mass (ASM), a useful health indicator, is the sum of muscle mass in all 4 limbs. It can predicted with the following formula, called Lee’s equation:
\[ASM(kg)= (0.244 \times weight(kg)) + (7.8 \times height(m)) + (6.6 \times sex) - (0.098 \times age) - 4.5\]
The sex
variable in the formula assumes that men are
coded as 1 and women are coded as 0 (which is already the case for our
sarcopenia
dataset.) The - 4.5
at the end is a
constant used for Asians.
Calculate the ASM value for all individuals in the
sarcopenia
dataset. This value should be in a new column
called asm
# Complete the code with your answer:
<-
Q_asm_calculation #_____
sarcopenia #________________
In your data analysis workflow, you often need to redefine variable
types. You can do so with functions like
as.integer()
, as.factor()
,
as.character()
and as.Date()
within your
mutate()
call. Let’s see one example of this.
as.integer
as.integer()
converts any numeric values to
integers:
%>%
yao_bmi mutate(bmi_integer = as.integer(bmi))
Note that this truncates integers rather than rounding them
up or down, as you might expect. For example the BMI 22.8 in the third
row is truncated to 22. If you want rounded numbers, you can use the
round
function from base R
Using as.integer()
on a factor variable is a fast way of
encoding strings into numbers. It can be essential to do so for some
machine learning data processing.
%>%
yao_bmi mutate(bmi_integer = as.integer(bmi),
bmi_rounded = round(bmi))
The base R round()
function rounds “half down”. That is,
the number 3.5, for example, is rounded down to 3 by
round()
. This is weird. Most people expect 3.5 to be
rounded up to 4, not down to 3. So most of the time, you’ll
actually want to use the round_half_up()
function from
janitor.
In future lessons, you will discover how to manipulate dates and how
to convert to a date type using as.Date()
.
Use as_integer()
to convert the ages of respondents in
the sarcopenia
dataset to integers (truncating them in the
process). This should go in a new column called
age_integer
# Complete the code with your answer:
<-
Q_age_integer #_____
sarcopenia #________________
As you can imagine, transforming data is an essential step in any
data analysis workflow. It is often required to clean data and to
prepare it for further statistical analysis or for making plots. And as
you have seen, it is quite simple to transform data with dplyr’s
mutate()
function, although certain transformations are
trickier to achieve than others.
Congrats on making it through.
But your data wrangling journey isn’t over yet! In our next lessons, we will learn how to create complex data summaries and how to create and work with data frame groups. Intrigued? See you in the next lesson.
The following team members contributed to this lesson:
Some material in this lesson was adapted from the following sources:
Horst, A. (2022). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original work published 2020)
Create, modify, and delete columns — Mutate. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/mutate.html
Apply a function (or functions) across multiple columns — Across. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/across.html
Artwork was adapted from:
Other references: