Factors are an important data class for representing and working with
categorical variables in R. In this lesson, we will learn how to create
factors and how to manipulate them with functions from the
forcats package, a part of the tidyverse. Let’s dive
in!
You understand what factors are and how they differ from characters in R.
You are able to modify the order of factor levels.
You are able to modify the value of factor levels.
We will use a dataset containing information about HIV mortality in Colombia from 2010 to 2016, which is hosted on the open data platform ‘Datos Abiertos Colombia.’ You can learn more and access the full dataset here.
Each row corresponds to an individual who passed away from AIDS or AIDS-related-complications.
See the appendix at the bottom for the data dictionary describing all variables.
Factors are an important data class in R used to represent categorical variables.
A categorical variable takes on a limited set of possible values or levels. For example, country, race or political affiliation. These differ from free-form string variables that take arbitrary values, like person names, book titles or doctor’s comments.
Review of the Main Data Classes in R
Numeric: Represents continuous numerical data,
including decimal numbers.Integer: Specifically for whole numbers without decimal
places.Character: Used for text or string data.Logical: Represents boolean values (TRUE or
FALSE).Factor: Used for categorical data with predefined
levels or categories.Date: Represents dates without times.Factors have a few key advantages over character vectors for working with categorical data in R:
lm(), require
categorical variables to be input as factorsThis last point, controlling the order of factor levels, will be our primary focus.
Let’s see a practical example of the value of factors using the
hiv_mort dataset we loaded above.
Suppose you are interested in visualizing the patients in the dataset by their birth month. We can do this with ggplot:
However, there’s a hiccup: the x-axis (representing the months) is arranged alphabetically, with April first on the left, then August, and so on. But months should follow a specific chronological order!
We can arrange the plot in the desired order by creating a factor
using the factor() function:
hiv_mort_modified <-
hiv_mort %>%
mutate(birth_month = factor(x = birth_month,
levels = c("Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug",
"Sep", "Oct", "Nov", "Dec")))The syntax is straightforward: the x argument takes the
original character column, birth_month, and the
levels argument takes in the desired sequence of
months.
When we inspect the data type of the birth_month variable, we can see its transformation:
## [1] "factor"
## [1] "character"
Now we can regenerate the ggplot with the modified dataset:
The months on the x-axis are now displayed in the order we specified.
The new factor variable will respect the defined order in other
contexts as well. For example, compare how the count()
function displays the two frequency tables below:
## # A tibble: 13 × 2
## birth_month n
## <chr> <int>
## 1 Apr 33
## 2 Aug 33
## 3 Dec 39
## 4 Feb 28
## 5 Jan 35
## 6 Jul 33
## 7 Jun 42
## 8 Mar 31
## 9 May 30
## 10 Nov 31
## 11 Oct 42
## 12 Sep 33
## 13 <NA> 35
## # A tibble: 13 × 2
## birth_month n
## <fct> <int>
## 1 Jan 35
## 2 Feb 28
## 3 Mar 31
## 4 Apr 33
## 5 May 30
## 6 Jun 42
## 7 Jul 33
## 8 Aug 33
## 9 Sep 33
## 10 Oct 42
## 11 Nov 31
## 12 Dec 39
## 13 <NA> 35
::: watch-out Be mindful when creating factor levels! Any values in
the variable that are not included in the set of levels
provided to the levels argument will be converted to
NA.
For instance, if we missed some months in our example:
hiv_mort_missing_months <-
hiv_mort %>%
mutate(birth_month = factor(x = birth_month,
levels = c("Jan", "Feb", "Mar", "Apr",
# missing months
"Sep", "Oct", "Nov", "Dec")))We end up with a lot of NA values:
You will have the same problem if there are typographical errors:
hiv_mort_with_typos <-
hiv_mort %>%
mutate(birth_month = factor(x = birth_month,
levels = c("Jan", "Feb", "Mar", "Apr",
"Moy", "Jon", "Jol", "Aog", # typos
"Sep", "Oct", "Nov", "Dec")))You can use factor without levels. It just uses default (alphabetical) arrangement of levels
## [1] "factor"
## [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Sep"
Using the hiv_mort dataset, convert the
gender variable to a factor with the levels “Female” and
“Male”, in that order.
What errors are you able to spot in the following code chunk? What are the consequences of these errors?
What is one main advantage of using factors over characters for categorical data in R?
forcatsFactors are very useful, but they can sometimes be a little tedious
to manipulate using base R functions alone. Thankfully, the
forcats package, a member of the tidyverse, offers a set of
functions that make factor manipulation much simpler. We’ll consider
four functions here, but there are many others, so we encourage you to
explore the forcats website on your own time here!
The fct_relevel() function is used to manually change
the order of factor levels.
For example, let’s say we want to visualize the frequency of individuals in our dataset by municipality type. When we create a bar plot, the values are ordered alphabetically by default:
But what if we want a specific value, say “Populated center”, to appear first in the plot?
This can be achieved using fct_relevel(). Here’s
how:
hiv_mort_pop_center_first <-
hiv_mort %>%
mutate(municipality_type = fct_relevel(municipality_type, "Populated center"))The syntax is straightforward: we pass the factor variable as the first argument, and the level we want to move to the front as the second argument.
Now when we plot:
The “Populated center” level is now first.
We can move the “Populated center” level to a different position with
the after argument:
hiv_mort %>%
mutate(municipality_type = fct_relevel(municipality_type, "Populated center",
after = 2)) %>%
# pipe directly into to plot to visualize change
ggplot() +
geom_bar(aes(x = municipality_type))The syntax is: specify the factor, the level to move, and use the
after argument to define what position to place it
after.
We can also move multiple levels at a time by providing these levels
to fct_relevel():
Below we arrange all the factor levels for municipality type in our desired order:
hiv_mort %>%
mutate(municipality_type = fct_relevel(municipality_type,
"Scattered rural",
"Populated center",
"Municipal head")) %>%
ggplot() +
geom_bar(aes(x = municipality_type))This is similar to creating a factor from scratch with levels in that order:
hiv_mort %>%
mutate(municipality = factor(municipality_type,
levels = c("Scattered rural",
"Populated center",
"Municipal head")))
ggplot() +
geom_bar(aes(x = municipality))Using the hiv_mort dataset, convert the
death_location variable to a factor such that
‘Home/address’ is the first level. Then create a bar plot that shows the
count of individuals in the dataset by death_location.
fct_reorder() is used to reorder the levels of a factor
based on the values of another variable.
To illustrate, let’s make a summary table with number of deaths, mean and median age at death for each municipality:
summary_per_muni <-
hiv_mort %>%
group_by(municipality_name) %>%
summarise(n_deceased = n(),
mean_age_death = mean(age_at_death, na.rm = T),
med_age_death = median(age_at_death, na.rm = T))
summary_per_muni ## # A tibble: 25 × 4
## municipality_name n_deceased mean_age_death med_age_death
## <chr> <int> <dbl> <dbl>
## 1 Aguadas 2 42 42
## 2 Anserma 15 37.4 37.5
## 3 Aranzazu 2 37.5 37.5
## 4 Belalcázar 4 38.8 41
## 5 Chinchiná 62 43.6 42.5
## 6 Filadelfia 5 42.6 43
## 7 La Dorada 46 41.0 41
## 8 La Merced 3 27 28
## 9 Manizales 199 41.0 41
## 10 Manzanares 3 38.3 34
## # ℹ 15 more rows
When plotting one of the variables, we may want to arrange the factor levels by that numeric variable. For example, to order municipality by the mean age column:
summary_per_muni_reordered <-
summary_per_muni %>%
mutate(municipality_name = fct_reorder(.f = municipality_name,
.x = mean_age_death))The syntax is:
.f - the factor to reorder.x - the numeric vector determining the new orderWe can now plot a nicely arranged bar chart:
Starting with the summary_per_muni data frame, reorder
the municipality (municipality_name) by the
med_age_death column and plot the reordered bar chart.
.fun argumentSometimes we want the categories in our plot to appear in a specific
order that is determined by a summary statistic. For example, consider
the box plot of birth_year by
marital_status:
The boxplot displays the median birth_year for each
category of marital status as a line in the middle of each box. We might
want to arrange the marital_status categories in order of these medians.
But if we create a summary table with medians, like we did before with
summary_per_muni, we can’t create a box plot with it (go
look at the summary_per_muni data frame to verify this
yourself).
This is where the .fun argument of
fct_reorder() comes in. The .fun argument
allows us to specify a summary function that will be used to calculate
the new order of the levels:
hiv_mort_arranged_marital <-
hiv_mort %>%
mutate(marital_status = fct_reorder(.f = marital_status,
.x = birth_year,
.fun = median,
na.rm = TRUE)) In this code, we are reordering the marital_status
factor based on the median of birth_year. We include the
argument na.rm = TRUE to ignore NA values when calculating
the median.
Now, when we create our box plot, the marital_status
categories are ordered by the median birth_year:
We can see that individuals with the marital status “cohabiting” tend to be the youngest (they were born in the latest years).
Using the hiv_mort dataset, make a boxplot of
birth_year by health_insurance_status, where
the health_insurance_status categories are arranged by the
median birth_year.
The fct_recode() function allows us to manually change
the values of factor levels. This function can be especially helpful
when you need to rename categories or when you want to merge multiple
categories into one.
For example, we can rename ‘Municipal head’ to ‘City’ in the
municipality_type variable:
hiv_mort_muni_recode <- hiv_mort %>%
mutate(municipality_type = fct_recode(municipality_type,
"City" = "Municipal head"))
# View the change
levels(hiv_mort_muni_recode$municipality_type)## [1] "City" "Populated center" "Scattered rural"
In the above code, fct_recode() takes two arguments: the
factor variable you want to change (municipality_type), and
the set of name-value pairs that define the recoding. The new level
(“City”) is on the left of the equals sign, and the old level
(“Municipal head”) is on the right.
fct_recode() is particularly useful for compressing
multiple categories into fewer levels:
We can explore this using the education_level variable. Currently it has six categories:
## # A tibble: 6 × 2
## education_level n
## <chr> <int>
## 1 No information 88
## 2 None 22
## 3 Post-secondary 29
## 4 Preschool 3
## 5 Primary 187
## 6 Secondary 116
For simplicity, let’s group them into just three categories - primary & below, secondary & above and other:
hiv_mort_educ_simple <-
hiv_mort %>%
mutate(education_level = fct_recode(education_level,
"primary & below" = "Primary",
"primary & below" = "Preschool",
"secondary & above" = "Secondary",
"secondary & above" = "Post-secondary",
"others" = "No information",
"others" = "None"))This condenses the categories nicely:
## # A tibble: 3 × 2
## education_level n
## <fct> <int>
## 1 others 110
## 2 secondary & above 145
## 3 primary & below 190
For good measure, we can arrange the levels in a reasonable order, with “others” as the last level:
hiv_mort_educ_sorted <-
hiv_mort_educ_simple %>%
mutate(education_level = fct_relevel(education_level,
"primary & below",
"secondary & above",
"others"))This condenses the categories nicely:
## # A tibble: 3 × 2
## education_level n
## <fct> <int>
## 1 primary & below 190
## 2 secondary & above 145
## 3 others 110
Using the hiv_mort dataset, convert death_location to a
factor.
Then use fct_recode() to rename ‘Public way’ in
death_location to ‘Public place’. Plot the frequency counts of the
updated variable.
fct_recode vs case_when/if_else
You might question why we need fct_recode() when we can
utilize case_when() or if_else() or even
recode() to substitute specific values. The issue is that
these other functions can disrupt your factor variable.
To illustrate, let’s say we choose to use case_when() to
make a modification to the education_level variable of the
hiv_mort_educ_sorted data frame.
As a quick reminder, that the education_level variable
is a factor with three levels, arranged in a specified order, with
“primary & below” first and “others” last:
## # A tibble: 3 × 2
## education_level n
## <fct> <int>
## 1 primary & below 190
## 2 secondary & above 145
## 3 others 110
Say we wanted to replace the “others” with “other”, removing the “s”. We can write:
hiv_mort_educ_other <-
hiv_mort_educ_sorted %>%
mutate(education_level = if_else(education_level == "others",
"other", education_level))After this operation, the variable is no longer a factor:
## [1] "character"
If we then create a table or plot, our order is disrupted and reverts to alphabetical order, with “other” as the first level:
## # A tibble: 3 × 2
## education_level n
## <chr> <int>
## 1 other 110
## 2 primary & below 190
## 3 secondary & above 145
However, if we had used fct_recode() for recoding, we
wouldn’t face this issue:
hiv_mort_educ_other_fct <-
hiv_mort_educ_simple %>%
mutate(education_level = fct_recode(education_level, "other" = "others"))The variable remains a factor:
## [1] "factor"
And if we create a table or a plot, our order is preserved: primary, secondary, then other:
## # A tibble: 3 × 2
## education_level n
## <fct> <int>
## 1 other 110
## 2 secondary & above 145
## 3 primary & below 190
Sometimes, we have too many levels for a display table or plot, and we want to lump the least frequent levels into a single category, typically called ‘Other’.
This is where the convenience function fct_lump() comes
in.
In the below example, we lump less frequent municipalities into ‘Other’, preserving just the top 5 most frequent municipalities:
hiv_mort_lump_muni <- hiv_mort %>%
mutate(municipality_name = fct_lump(municipality_name, n = 5))
ggplot(hiv_mort_lump_muni, aes(x = municipality_name)) +
geom_bar()In the usage above, the parameter n = 5 means that the
five most frequent municipalities are preserved, and the rest are lumped
into ‘Other’.
We can provide a custom name for the other category with the
other_level argument. Below we use the name
"Other municipalities".
hiv_mort_lump_muni_other_name <- hiv_mort %>%
mutate(municipality_name = fct_lump(municipality_name, n = 5,
other_level = "Other municipalities"))
ggplot(hiv_mort_lump_muni_other_name, aes(x = municipality_name)) +
geom_bar()In this way, fct_lump() is a handy tool for condensing
factors with many infrequent levels into a more manageable number of
categories.
Starting with the hiv_mort dataset, use
fct_lump() to create a bar chart with the frequency of the
10 most common occupations.
Lump the remaining occupation into an ‘Other’ category.
Put occupation on the y-axis, not the x-axis, to avoid
label overlap.
Congrats on getting to the end. In this lesson, you learned details
about the data class, factors, and how to manipulate
them using basic operations such as fct_relevel(),
fct_reorder() fct_recode(), and
fct_lump().
While these covered common tasks such as reordering, recoding, and collapsing levels, this introduction only scratches the surface of what’s possible with the forcats package. Do explore more on the forcats website.
Now that you understand the basics of working with factors, you are equipped to properly represent your categorical data in R for downstream analysis and visualization.
Errors:
Consequences:
Any rows with the values “May”, “Nov”, or “Aug” for death_month will be converted to NA in the new death_month variable. If you create plots, ggplot will drop these levels with only NA values.
The other two statements are not true.
If you want to apply string operations like substr(), strsplit(), paste(), etc., it’s actually more straightforward to use character vectors than factors.
And while many statistical functions expect factors, not characters, for categorical predictors, this does not make them more “accurate”.
# Convert death_location to a factor with 'Home/address' as the first level
hiv_mort <- hiv_mort %>%
mutate(death_location = fct_relevel(death_location, "Home/address"))
# Create a bar plot of death_location
ggplot(hiv_mort, aes(x = death_location)) +
geom_bar() +
labs(title = "Count of Individuals by Death Location")# Reorder municipality_name by med_age_death and plot
summary_per_muni <- summary_per_muni %>%
mutate(municipality_name = fct_reorder(municipality_name, med_age_death))
# Create a bar plot of reordered municipality_name
ggplot(summary_per_muni, aes(x = municipality_name)) +
geom_bar() +
labs(title = "Municipality by Median Age at Death")# Create a boxplot of birth_year by health_insurance_status
ggplot(hiv_mort, aes(x = health_insurance_status, y = birth_year)) +
geom_boxplot() +
labs(title = "Boxplot of Birth Year by Health Insurance Status") +
coord_flip() # To display categories on the y-axis# Convert death_location to a factor and rename 'Public way' to 'Public place'
hiv_mort <- hiv_mort %>%
mutate(death_location = fct_recode(death_location, "Public place" = "Public way"))
# Plot frequency counts of the updated variable
ggplot(hiv_mort, aes(x = death_location)) +
geom_bar() +
labs(title = "Frequency Counts of Death Location")# Create a bar chart with the frequency of the 10 most common occupations
hiv_mort <- hiv_mort %>%
mutate(occupation = fct_lump(occupation, n = 10, other_level = "Other"))
# Create the bar plot with occupation on the y-axis
ggplot(hiv_mort, aes(x = occupation)) +
geom_bar() +
labs(title = "Frequency of the 10 Most Common Occupations") +
coord_flip() # To display categories on the y-axisThe variables in the dataset are:
municipality: general municipal location of the
patient [chr]
death_location: location where the patient died
[chr]
birth_date: full date of birth, formatted
“YYYY-MM-DD” [date]
birth_year: year when the patient was born
[dbl]
birth_month: month when the patient was born
[chr]
birth_day: day when the patient was born
[dbl]
death_year: year when the patient died
[dbl]
death_month: month when the patient died
[chr]
death_day: day when the patient died [dbl]
gender: gender of the patient [chr]
education_level: highest level of education attained
by patient [chr]
occupation: occupation of patient [chr]
racial_id: race of the patient [chr]
municipality_code: specific municipal location of
the patient [chr]
primary_cause_death_description: primary cause of
the patient’s death [chr]
primary_cause_death_code: code of the primary cause
of death [chr]
secondary_cause_death_description: secondary cause
of the patient’s death [chr]
secondary_cause_death_code: code of the secondary
cause of death [chr]
tertiary_cause_death_description: tertiary cause of
the patient’s death [chr]
tertiary_cause_death_code: code of the tertiary
cause of death [chr]
quaternary_cause_death_description: quaternary cause
of the patient’s death [chr]
quaternary_cause_death_code: code of the quaternary
cause of death [chr]
The following team members contributed to this lesson: