1 Intro

You now know how to keep or drop columns and rows from your dataset. Today you will learn how to modify existing variables or create new ones, using the mutate() verb from {dplyr}. This is an essential step in most data analysis projects.

Let’s go!

Fig: the mutate() verb.

2 Learning objectives

  1. You can use the mutate() function from the {dplyr} package to create new variables or modify existing variables.

  2. You can create new numeric, character, factor, and boolean variables

3 Packages

This lesson will require the packages loaded below:

if(!require(pacman)) install.packages("pacman")
pacman::p_load(here, 
               janitor,
               tidyverse)

4 Datasets

In this lesson, we will again use the data from the COVID-19 serological survey conducted in Yaounde, Cameroon. Below, we import the dataset yaounde and create a smaller subset called yao. Note that this dataset is slightly different from the one used in the previous lesson.

yaounde <- read_csv(here::here('data/yaounde_data.csv'))

## a smaller subset of variables
yao <- yaounde %>% select(date_surveyed, 
                          age, 
                          weight_kg, height_cm, 
                          symptoms, is_smoker)

yao

We will also use a dataset from a cross-sectional study that aimed to determine the prevalence of sarcopenia in the elderly population (>60 years) in in Karnataka, India. Sarcopenia is a condition that is common in elderly people and is characterized by progressive and generalized loss of skeletal muscle mass and strength. The data was obtained from Zenodo here, and the source publication can be found here.

Below, we import and view this dataset:

sarcopenia <- read_csv(here::here('data/sarcopenia_elderly.csv'))

sarcopenia

5 Introducing mutate()

The mutate() function. (Drawing adapted from Allison Horst)

We use dplyr::mutate() to create new variables or modify existing variables. The syntax is quite intuitive, and generally looks like df %>% mutate(new_column_name = what_it_contains).

Let’s see a quick example.

The yaounde dataset currently contains a column called height_cm, which shows the height, in centimeters, of survey respondents. Let’s create a data frame, yao_height, with just this column, for easy illustration:

yao_height <- yaounde %>% select(height_cm)
yao_height

What if you wanted to create a new variable, called height_meters where heights are converted to meters? You can use mutate() for this, with the argument height_meters = height_cm/100:

yao_height %>% 
  mutate(height_meters = height_cm/100)

Great. The syntax is beautifully simple, isn’t it?

Sometimes it is helpful to think of data manipulation functions in the context of familiar spreadsheet software. Here is what the R command mutate(height_m = height_cm/100) would be equivalent to in Google Sheets:

Now, imagine there was a small error in the equipment used to measure respondent heights, and all heights are 5cm too small. You therefore like to add 5cm to all heights in the dataset. To do this, rather than creating a new variable as you did before, you can modify the existing variable with mutate:

yao_height %>% 
  mutate(height_cm = height_cm + 5)

Again, very easy to do!

The sarcopenia data frame has a variable weight_kg, which contains respondents’ weights in kilograms. Create a new column, called weight_grams, with respondents’ weights in grams. Store your answer in the Q_weight_to_g object. (1 kg equals 1000 grams.)

# Complete the code with your answer:
Q_weight_to_g <- 
  sarcopenia %>% 
  _____________________

Hopefully you now see that the mutate function is quite user-friendly. In theory, we could end the lesson here, because you now know how to use mutate() 😃. But of course, the devil will be in the details—the interesting thing is not mutate() itself but what goes inside the mutate() call.

The rest of the lesson will go through a few use cases for the mutate() verb. In the process, we’ll touch on several new functions you have not yet encountered.

6 Creating a Boolean variable

You can use mutate() to create a Boolean variable to categorize part of your population.

Below we create a Boolean variable, is_child which is either TRUE if the subject is a child or FALSE if the subject is an adult (first, we select just the age variable so it’s easy to see what is being done; you will likely not need this pre-selection for your own analyses).

yao %>%
  select(age) %>% 
  mutate(is_child = age <= 18)

The code age <= 18 evaluates whether each age is less than or equal to 18. Ages that match that condition (ages 18 and under) are TRUE and those that fail the condition are FALSE.

Such a variable is useful to, for example, count the number of children in the dataset. The code below does this with the janitor::tabyl() function:

yao %>%
  mutate(is_child = age <= 18) %>% 
  tabyl(is_child)

You can observe that 31.8% (0.318…) of respondents in the dataset are children.

Let’s see one more example, since the concept of Boolean variables can be a bit confusing. The symptoms variable reports any respiratory symptoms experienced by the patient:

yao %>% 
  select(symptoms)

You could create a Boolean variable, called has_no_symptoms, that is set to TRUE if the respondent reported no symptoms:

yao %>% 
  select(symptoms) %>% 
  mutate(has_no_symptoms = symptoms == "No symptoms")

Similarly, you could create a Boolean variable called has_any_symptoms that is set to TRUE if the respondent reported any symptoms. For this, you’d simply swap the symptoms == "No symptoms" code for symptoms != "No symptoms":

yao %>% 
  select(symptoms) %>% 
  mutate(has_any_symptoms = symptoms != "No symptoms")

Still confused by the Boolean examples? That’s normal. Pause and play with the code above a little. Then try the practice question below

Women with a grip strength below 20kg are considered to have low grip strength. With a female subset of the sarcopenia data frame, add a variable called low_grip_strength that is TRUE for women with a grip strength < 20 kg and FALSE for other women.

# Complete the code with your answer:
Q_women_low_grip_strength <- 
  sarcopenia %>% 
  filter(sex_male_1_female_0 == 0) # first we filter the dataset to only women
  # mutate code here

What percentage of women surveyed have a low grip strength according to the definition above? Enter your answer as a number without quotes (e.g. 43.3 or 12.2), to one decimal place.

Q_prop_women_low_grip_strength <- YOUR_ANSWER_HERE

7 Creating a numeric variable based on a formula

Now, let’s look at an example of creating a numeric variable, the body mass index (BMI), which a commonly used health indicator. The formula for the body mass index can be written as:

\[ BMI = \frac{weight (kilograms)}{height (meters)^2} \] You can use mutate() to calculate BMI in the yao dataset as follows:

yao %>%
  select(weight_kg,  height_cm) %>% 
  
  # first obtain the height in meters  
  mutate(height_meters = height_cm/100) %>% 
  
  # then use the BMI formula
  mutate(bmi = weight_kg / (height_meters)^2)

Let’s save the data frame with BMIs for later. We will use it in the next section.

yao_bmi <- 
  yao %>%
  select(weight_kg,  height_cm) %>% 
  # first obtain the height in meters  
  mutate(height_meters = height_cm/100) %>% 
  # then use the BMI formula
  mutate(bmi = weight_kg / (height_meters)^2)

Appendicular muscle mass (ASM), a useful health indicator, is the sum of muscle mass in all 4 limbs. It can predicted with the following formula, called Lee’s equation:

\[ASM(kg)= (0.244 \times weight(kg)) + (7.8 \times height(m)) + (6.6 \times sex) - (0.098 \times age) - 4.5\]

The sex variable in the formula assumes that men are coded as 1 and women are coded as 0 (which is already the case for our sarcopenia dataset.) The - 4.5 at the end is a constant used for Asians.

Calculate the ASM value for all individuals in the sarcopenia dataset. This value should be in a new column called asm

# Complete the code with your answer:
Q_asm_calculation <- 
  sarcopenia #_____
  #________________

8 Changing a variable’s type

In your data analysis workflow, you often need to redefine variable types. You can do so with functions like as.integer(), as.factor(), as.character() and as.Date() within your mutate() call. Let’s see one example of this.

8.1 Integer: as.integer

as.integer() converts any numeric values to integers:

yao_bmi %>% 
  mutate(bmi_integer = as.integer(bmi))

Note that this truncates integers rather than rounding them up or down, as you might expect. For example the BMI 22.8 in the third row is truncated to 22. If you want rounded numbers, you can use the round function from base R

Using as.integer() on a factor variable is a fast way of encoding strings into numbers. It can be essential to do so for some machine learning data processing.

yao_bmi %>% 
  mutate(bmi_integer = as.integer(bmi), 
         bmi_rounded = round(bmi)) 

The base R round() function rounds “half down”. That is, the number 3.5, for example, is rounded down to 3 by round(). This is weird. Most people expect 3.5 to be rounded up to 4, not down to 3. So most of the time, you’ll actually want to use the round_half_up() function from janitor.

In future lessons, you will discover how to manipulate dates and how to convert to a date type using as.Date().

Use as_integer() to convert the ages of respondents in the sarcopenia dataset to integers (truncating them in the process). This should go in a new column called age_integer

# Complete the code with your answer:
Q_age_integer <- 
  sarcopenia #_____
  #________________

9 Wrap up

As you can imagine, transforming data is an essential step in any data analysis workflow. It is often required to clean data and to prepare it for further statistical analysis or for making plots. And as you have seen, it is quite simple to transform data with dplyr’s mutate() function, although certain transformations are trickier to achieve than others.

Congrats on making it through.

But your data wrangling journey isn’t over yet! In our next lessons, we will learn how to create complex data summaries and how to create and work with data frame groups. Intrigued? See you in the next lesson.

Fig: Basic Data Wrangling with select(), filter(), and mutate().

Contributors

The following team members contributed to this lesson:

References

Some material in this lesson was adapted from the following sources:

Artwork was adapted from:

Other references:

  • Lee, Robert C, ZiMian Wang, Moonseong Heo, Robert Ross, Ian Janssen, and Steven B Heymsfield. “Total-Body Skeletal Muscle Mass: Development and Cross-Validation of Anthropometric Prediction Models.” The American Journal of Clinical Nutrition 72, no. 3 (2000): 796–803. https://doi.org/10.1093/ajcn/72.3.796.