Data wrangling often involves applying the same operations separately
to different groups within the data. This pattern, sometimes called
“split-apply-combine”, is easily accomplished in {dplyr} by chaining the
group_by() verb with other wrangling verbs like
filter(), mutate(), and arrange()
(all of which you have seen before!).
In this lesson, you’ll become confident with these kinds of grouped manipulations.
Let’s get started.
group_by() with arrange(),
filter(), and mutate() to conduct grouped
operations on a data frame.This lesson will require the {tidyverse} suite of packages and the {here} package:
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, here)In this lesson, we will again use data from the COVID-19 serological
survey conducted in Yaounde, Cameroon. Below, we import the data, create
a small data frame subset, yao and an even smaller subset,
yao_sex_weight.
yao <-
read_csv(here::here('data/yaounde_data.csv')) %>%
select(sex, age, age_category, weight_kg, occupation, igg_result, igm_result)
yaoyao_sex_weight <-
yao %>%
select(sex, weight_kg)
yao_sex_weightFor practice questions, we will also use the sarcopenia data set that you have seen previously:
sarcopenia <- read_csv(here::here('data/sarcopenia_elderly.csv'))
sarcopeniaThe arrange() function orders the rows of a data frame
by the values of selected columns. This function is only sensitive to
groupings when we set its argument .by_group to
TRUE. To illustrate this, consider the
yao_sex_weight data frame:
yao_sex_weightWe can arrange this data frame by weight like so:
yao_sex_weight %>%
arrange(weight_kg)As expected, lower weights have been brought to the top of the data frame.
If we first group the data, we might expect a different output:
yao_sex_weight %>%
group_by(sex) %>%
arrange(weight_kg)But as you see, the arrangement is still the same.
Only when we set the .by_group argument to
TRUE do we get something different:
yao_sex_weight %>%
group_by(sex) %>%
arrange(weight_kg, .by_group = TRUE)Now, the data is first sorted by sex (all women first), and then by weight.
arrange() can group
automaticallyIn reality we do not need group_by() to arrange by
group; we can simply put multiple variables in the
arrange() function for the same effect.
So this simple arrange() statement:
yao_sex_weight %>%
arrange(sex, weight_kg)is equivalent to the more complex group_by(),
arrange() statement used before:
yao_sex_weight %>%
group_by(sex) %>%
arrange(weight_kg, .by_group = TRUE)The code arrange(sex, weight_kg) tells R to arrange the
rows first by sex, and then by weight.
Obviously, this syntax, with just arrange(), and no
group_by() is simpler, so you can stick to it.
desc() for descending order
Recall that to arrange in descending order, we can wrap the
target variable in desc(). So, for example, to sort by sex
and weight, but with the heaviest people on top, we can run:
yao_sex_weight %>%
arrange(sex, desc(weight_kg))With an arrange() call, sort the sarcopenia
data first by sex and then by grip strength. (If done correctly, the
first row should be of a woman with a grip strength of 1.3 kg). To make
the arrangement clear, you should first select() the sex
and grip strength variables.
# Complete the code with your answer:
Q_grip_strength_arranged <-
sarcopenia %>%
select(______________________________) %>%
arrange(______________________________)The sarcopenia dataset contains a column,
age_group, which stores age groups as a string (the age
groups are “Sixties”, “Seventies” and “Eighties”). Convert this variable
to a factor with the levels in the right order (first “Sixties” then
“Seventies” and so on). (Hint: Look back on the case_when()
lesson if you do not see how to relevel a factor.)
Then, with a nested arrange() call, arrange the data
first by the newly-created age_group factor variable
(younger individuals first) and then by height_meters, with
shorter individuals first.
# Complete the code with your answer:
Q_age_group_height <-
sarcopenia The filter() function keeps or drops rows based on a
condition. If filter() is applied to grouped data, the
filtering operation is carried out separately for each group.
To illustrate this, consider again the yao_sex_weight
data frame:
yao_sex_weightIf we want to filter the data for the heaviest person, we could run:
yao_sex_weight %>%
filter(weight_kg == max(weight_kg))But if we want to get heaviest person per sex group (the heaviest man
and the heaviest woman), we can use group_by(sex)
then filter():
yao_sex_weight %>%
group_by(sex) %>%
filter(weight_kg == max(weight_kg))Great! The code above can be translated as “For each sex group, keep
the row with the maximum weight_kg value”.
filter() will work fine with any number of nested
groupings.
For example, if we want to see the heaviest man and heaviest woman
per age group we could run the following on the
yao data frame:
yao %>%
group_by(sex, age_category) %>%
filter(weight_kg == max(weight_kg))This code groups by sex and age category, and then finds the heaviest person in each sub-category.
(Why do we have 10 rows in the output? Well, 2 sex groups x 5 groups age groups = 10 unique groupings.)
The output is a bit scattered though, so we can chain this with the
arrange() function, to arrange by sex and age group.
yao %>%
group_by(sex, age_category) %>%
filter(weight_kg == max(weight_kg)) %>%
arrange(sex, age_category)Now the data is easier to read. All women come first, then men. But
we see notice a weird arrangement of the age groups! Those aged 5 to 14
should come first in the arrangement. Of course, we’ve learned
how to fix this—the factor() function, and its
levels argument:
yao %>%
mutate(age_category = factor(
age_category,
levels = c("5 - 14", "15 - 29", "30 - 44", "45 - 64", "65 +")
)) %>%
group_by(sex, age_category) %>%
filter(weight_kg == max(weight_kg)) %>%
arrange(sex, age_category)Now we have a nice and well-arranged output!
Group the sarcopenia data frame by age group and sex,
then filter for the highest skeletal muscle index in each (nested)
group.
# Complete the code with your answer:
Q_max_skeletal_muscle_index <-
sarcopenia mutate() is used to modify columns or to create new
ones. With grouped data, mutate() operates over each group
independently.
Let’s first consider a regular mutate() call, not a
grouped one. Imagine that you wanted to add a column that ranks
respondents by weight. This can be done with the rank()
function inside a mutate() call:
yao_sex_weight %>%
mutate(weight_rank = rank(weight_kg))The output shows that the first row is the 901st lightest individual.
But it would be more intuitive to rank in descending order with the
heaviest person first. We can do this with the desc()
function:
yao_sex_weight %>%
mutate(weight_rank = rank(desc(weight_kg)))The output shows that the person in the first row is the 71st heaviest individual.
Now, let’s try to write a grouped mutate() call. Imagine
we want to add this weight rank column per sex group in the
data frame. That is, we want to know each person’s weight rank in their
sex category. In this case, we can chain group_by(sex) with
mutate():
yao_sex_weight %>%
group_by(sex) %>%
mutate(weight_rank = rank(desc(weight_kg)))Now we see that the person in the first row is the 53rd heaviest woman. (The .5 indicates that this rank is a tie with someone else in the data.)
We could also arrange the data to make things clearer:
yao_sex_weight %>%
group_by(sex) %>%
mutate(weight_rank = rank(desc(weight_kg))) %>%
arrange(sex, weight_rank)Of course, as with the other verbs we have seen,
mutate() also works with nested groups.
For example, below we create the nested grouping of age and
sex with the yao data frame, then add a rank column with
mutate():
yao %>%
group_by(sex, age_category) %>%
mutate(weight_rank = rank(desc(weight_kg)))The output shows that the person in the first row is 20th heaviest woman in the 45 to 64 age group.
With the sarcopenia data, group by
age_group, then in a new variable called
grip_strength_rank, compute the per-age-group rank of each
individual’s grip strength. (To compute the rank, use
mutate() and the rank() function with its
default ties method.)
# Complete the code with your answer:
Q_rank_grip_strength <-
sarcopeniaRemember to ungroup data before further analysis
As has been mentioned before, it is important ungroup your data before doing further analysis.
Consider this last example, where we computed the weight rank of individuals per age and sex group:
yao %>%
group_by(sex, age_category) %>%
mutate(weight_rank = rank(desc(weight_kg)))If, in the process of analysis, you stored this output as a new data frame:
yao_modified <-
yao %>%
group_by(sex, age_category) %>%
mutate(weight_rank = rank(desc(weight_kg)))And then, later on, you picked up the data frame and tried some other analysis, for example, filtering to get the oldest person in the data:
yao_modified %>%
filter(age == max(age))You might be confused by the output! Why are there 55 rows of “oldest people”?
This would be because you forgot to ungroup the data before storing it for further analysis. Let’s do this properly now
yao_modified <-
yao %>%
group_by(sex, age_category) %>%
mutate(weight_rank = rank(desc(weight_kg))) %>%
ungroup()Now we can correctly obtain the oldest person/people in the data set:
yao_modified %>%
filter(age == max(age))group_by() is a marvelous tool for arranging, mutating,
filtering based on the groups within a single or multiple variables.
Fig: filter() and group_by()
There are numerous ways of combining these verbs to manipulate your data. We invite you to take some time and to try these verbs out in different combinations!
See you next time!
The following team members contributed to this lesson:
Some material in this lesson was adapted from the following sources:
Horst, A. (2022). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original work published 2020)
Group by one or more variables. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/group_by.html
Create, modify, and delete columns — Mutate. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/mutate.html
Subset rows using column values — Filter. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/filter.html
Arrange rows by column values — Arrange. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/arrange.html
Artwork was adapted from: