By the end of this lesson, you will be able to:
geom_histogram()
.bins
or
binwidth
arguments.boundary
argument.breaks
argument.A histogram is a plot that visualizes the distribution of a numerical value as follows:
We first cut up the x-axis into a series of bins, where each bin represents a range of values.
For each bin, we count the number of observations that fall in the range corresponding to that bin.
Then for each bin, we draw a bar whose height marks the corresponding count.
We will visualize distributions of numerical variables in the
malidd
data frame, which we’ve seen in previous
lessons.
These data were collected as part of an observational study of acute diarrhea in children aged 0-59 months. The study was conducted in Mali and in early 2020. The dataset records demographic and clinical information for 150 patients.
The dataframe has 21 variables, many of which are continuous, like
height_cm
, viral_load
, and
freqrespi
.
geom_histogram()
Now let’s use {ggplot2} to plot the distribution of childrens’
heights, which is recorded in the heigh_cm
column of
malidd
.
The geom_*()
function used for histograms is
geom_histogram()
# Simple histogram showing the distribution of height_cm
ggplot(data = malidd,
mapping = aes(x = height_cm)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
If we don’t adjust the bins in geom_histogram()
, we get
a warning message. You can ignore this warning message for now, and will
learn how to customize bins in the next section.
In the previous histogram, it’s hard to where the boundaries for each bin start and end since everything is one big amorphous blob. So let’s add borders around the bins:
# Set border color to "white"
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white")
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
We now have an easier time associating ranges of cases to each of the bins.
We can also vary the color of the bars by setting the
fill
argument:
# Set fill color to "steelblue"
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue")
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
Now that we can see the bars more clearly, let’s unpack the resulting histogram. Some questions we might want to answer are:
We can see that heights range from 50 to 105cm. The center is around 70cm, most patients fall in the 60-80cm range, with very few below 55cm or above 90cm. Observe that the histogram has a bell shape, meaning that the variable has a normal distribution (more or less).
Plot a histogram showing the distribution of age
(age_months
) in malidd
. Make the borders and
fill of the bars “seagreen”, and reduce opacity to 40%.
Building on your code for the previous plot, modify the axis titles to “Age (months)” and “Number of children”, respectively.
After running code in previous examples, we got a histogram as well
as a warning message about bins and bin width. The warning message is
telling us that the histogram was constructed using
bins = 30
for 30 equally spaced bins.
# Warning message tells us to change the default of 30 bins
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue")
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
Unless you override this default number of bins with a number you specify, R will keep giving this message.
We can change the number of bins to another value using one of these
three arguments to geom_histogram()
:
Set the number of bins with
bins
Set the width of the bins with
binwidth
Set bin boundaries breaks
bins
Using the first method, we have the power to specify how many bins we
would like to cut the x-axis up in by setting
bins = INTEGER
:
# Try different numbers of bins
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(bins = 5,
color = "white",
fill = "steelblue")
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(bins = 20,
color = "white",
fill = "steelblue")
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(bins = 50,
color = "white",
fill = "steelblue")
Make a histogram of frequency of respiration
(freqrespi
), which is measured in breaths per minute. Set
the interior color to “indianred3”, and border color to “lightgray”.
Notice that with the default of 30 bins, there are some intervals for which no bar is plotted (i.e., there were no observations in that range).
Low the number of bins until there are no empty intervals. You should choose the highest value of bins for which there are no empty spaces.
binwidth
Using the second method, instead of specifying the number of bins, we
specify the width of the bins by using the binwidth
argument in geom_histogram()
.
# Try different bin widths
ggplot(data = malidd,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue",
binwidth = 3)
Looking at the range of the variable can help us choose an appropriate bin width.
## [1] 50.3 103.4
ggplot(data = malidd,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue",
binwidth = 5)
We can use the boundary
argument to align the bins to
the x-axis intervals.
# Set `boundary` equal to the low end of the variable
ggplot(data = malidd,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue",
binwidth = 5,
boundary = 50)
Create the same freqrespi
histogram from the last
practice question, but this time set the bin width to to a value that
results in 18 bins. Then align the bars to the x axis breaks by
adjusting the bin boundaries.
breaks
Set breaks
equal to a numeric vector in
geom_histogram()
:
# Supply a vector that covers the range of values in height_cm
ggplot(data = malidd,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue",
breaks = seq(50, 105, 5))
Plot the freqrespi
histogram with bin breaks that range
from the lowest value of freqrespi
to the highest, with
intervals of 4.
Next, adjust the x-axis scale breaks by adding a
scale_*()
function. Set the range to 24-60, with an
intervals of 8.
Histograms, unlike scatterplots and linegraphs, present information on only a single numerical variable. Specifically, they are visualizations of the distribution of the numerical variable in question.
The following team members contributed to this lesson:
Some material in this lesson was adapted from the following sources:
This work is licensed under the Creative Commons Attribution Share Alike license.