Scatter plots - which are sometimes called bivariate plots - allow you to visualize the relationship between two numerical variables.
They are among the most commonly used plots because they can provide an immediate way to see how one numerical variable varies against another.
Scatter plots can also display multiple relationships by mapping additional variable to aesthetic properties, such as color of the points.
Trends and relationships in a scatter plot can be made clearer by adding a smoothing line over the points.
We will use ggplot to do all that and more. Let’s get started!
geom_point()
.color
as an aesthetic
argument to map variables from the dataset onto individual points.geom_smooth()
.We will be using data collected for a prospective observational study of acute diarrhea in children aged 0-59 months. The study was conducted in Mali and in early 2020.
The full dataset can be obtained from Dryad, and the paper can be viewed here.
A prospective study watches for outcomes, such as the development of a disease, during the study period and relates this to other factors such as suspected risk or protection factors.
Spend some time browsing through this dataset. Each row corresponds to one patient surveyed. There are demographic, physiological, clinical, socioeconomic, and geographic variables.
We will begin by visualizing the relationship between the following two numerical variables:
age_months
: the patient’s age in
months on the horizontal x-axis andviral_load
: the patient’s viral load
on the vertical y-axisgeom_point()
We will explore relationships between some numerical variables in the
malidd
data frame.
We will now examine at and run the code that will create the desired scatter plot, while keeping in mind the GG framework. Let’s take a look at the code and break it down piece-by-piece.
Remember that we specify the first two GG layers as arguments (i.e.,
inputs) within the ggplot()
function:
malidd
data frame with the
data
argument, by inputting
data = malidd
.aes
thetics
function of the mapping
argument, by
inputting
mapping = aes(x = age_months, y = viral_load)
.
Specifically, the variable age_months
is mapped to the
x
aesthetic, while the variable viral_load
is
mapped to the y
aesthetic.We then add the geom_*()
function on a
new layer with a +
sign. The geometric
objects (i.e., shapes) needed for a scatter plot are points, so we add
geom_point()
.
After running the following lines of code, you’ll produce the scatter plot below:
# Simple scatter plot of viral load vs age
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point()
This suggests that viral load generally decreases with age.
malidd
data frame, create a scatter plot
showing the relationship between age and height
(height_cm
).An aesthetic is a visual property of the geometric objects
(geom
s) in your plot. Aesthetics include things like the
size, the shape, or the color of your points. You can display a point in
different ways by changing the values of its aesthetic properties.
Remember, there are two methods for changing the aesthetic properties
of your geom
s (in this case, points).
You can convey information about your data by mapping
the variables in your dataset to aesthetics in your plot. For this
method, you use aes()
in the mapping
argument
to associate the name of the aesthetic with a variable to
display.
You can also set the aesthetic properties of your
geom
s manually. Here the aesthetic doesn’t convey
information about a variable, but only changes the appearance of the
plot. To change an aesthetic manually, you set the aesthetic by name as
an argument of your geom_*()
function; i.e. it goes
outside of aes()
.
In addition to mapping variables to the x and
y axes like with did above, variables can be mapped to
the color, shape, size, opacity, and other visual characteristics of
geom
s. This allows groups of observations to be
superimposed in a single graph.
To map a variable to an aesthetic, associate the name of the
aesthetic to the name of the variable inside aes()
. This
way, we can visualize a third variable to our simple two dimensional
scatter plot by mapping it to a new aesthetic.
For example, let’s map height_cm
to the colors of our
points, to show us how height varies with age and viral load:
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point(mapping = aes(color = height_cm))
We see that {ggplot2} has automatically assigned the values of our variable to an aesthetic, a process known as scaling. {ggplot2} will also add a legend that explains which levels correspond to which values.
Here the points are colored by different shades of the same blue hue, with darker colors representing lower values.
This shows us that height increases with age, as expected.
Instead of a continuous variable like height_cm
, we can
also map a binary variable like breastfeeding
, to show us
the which children are breastfed and which ones are not:
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point(mapping = aes(color = breastfeeding))
We get the same gradual color scaling like with did with height. This communicates a continuum of values, rather than the two distinct values in our variable - 0 or 1.
This is because of the data class of the breastfeeding
variable in malidd
:
## [1] "numeric"
But even though binary variables are numerical, they represent two discrete possibilities. So the continuous color scaling in the plot above is not ideal.
In cases like this, we add the function factor()
around
the breastfeeding
variable to tell ggplot()
to
treat the variable as a factor. Let’s see what happens when we do
that:
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point(mapping = aes(color = factor(breastfeeding)))
When the variable is treated like a factor, the colors chosen are
clearly distinguishable. With factors, {ggplot2} will automatically
assign a unique level of the aesthetic (here a unique color) to each
unique value of the variable. (this is what happened with the
region
variable of the nigerm
dataframe that
we use in the last lesson)
This plot reveals a clear relationship between age and breastfeeding, as we might expect. Children are likely to stop breastfeeding around 20 months of age. In this study, no child at or above 25 months was being breastfed.
Adding colors to the scatter plot allowed us to visualize a third variable in addition to the relationship between age and viral load. The third variable could be either discrete or continuous.
Using the malidd
data frame, create a scatter plot
showing the relationship between age and viral load, and map a third
variable, freqrespi
, to color:
Create the same age vs. height scatterplot again, but this time,
map the binary variable fever
to the color of the points.
Keep in mind that fever
should be treated as a
factor.
Aesthetic arguments set to a fixed value will be static, and the
visual effect is not data-dependent. To add a fixed aesthetic, we add as
a direct argument of the geom_*()
function; i.e., it goes
outside of mapping = aes()
.
Let’s look at some of the aesthetic arguments we can place directly
within geom_point()
to make visual changes to the points in
our scatter plot:
color
- point color or point outline color
size
- point size
alpha
- point opacity
shape
- point shape
fill
- point fill color (only applies if the point
has an outline)
To use these options to create a more attractive scatter plot, you’ll need to pick a value for each argument that makes sense for that aesthetic, as shown in the examples below.
color
, size
and alpha
Let’s change the color of the points to a fixed value by setting the
color
argument directly within geom_point()
.
The color we choose must be a character string that R recognizes as a
color. Here we will set the point colors to steel blue:
# Modify original scatter plot by setting `color = "steelblue"`
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point(color = "steelblue") # set color
In addition to changing the default color, now we will modify the
size
aesthetic of the points by assigning it to a fixed
number (in millimeters). The default size is 1 mm, so let’s chose a
larger value:
# Set size to 2 mm by ading `size = 2`
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point(color = "steelblue", # set color
size = 2) # set size (mm)
The alpha
aesthetic controls the level of opacity of
geom
s. alpha
is also numerical, and ranges
from 0 (completely transparent) to the default of 1 (completely opaque).
Let’s make our points more transparent by reducing the opacity:
# Set opacity to 75% by adding `alpha = 0.75`
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point(color = "steelblue", # set color
size = 2, # set size (mm)
alpha = 0.75) # set level of opacity
Now we can see where multiple points overlap. This is a useful parameter for scatter plots where there is overplotting.
Remember, changing the color, size, or opacity of our points here is not conveying any information in the data - they are design choices we make to create prettier plots.
cornflowerblue
, increase the size of points to 3 mm and set
the opacity to 60%.shape
and fill
We can change the appearance of points in a scatter plot with the
shape
aesthetic.
To change the shape of your geom
s to a fixed value, set
shape
equal to a number corresponding to your desired
shape.
{ggplot2} will accept the following numbers:
Notice that
some of the shapes are filled in with red. This indicates that objects
21-24 are sensitive to both
color
and fill
,
but the others are only sensitive to color.
First let’s modify our original scatterplot by changing the shapes to a something that can be filled in:
# Set shape to fillable circles by adding `shape = 21`
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point(shape = 21) # set shapes to display
Fillable shapes can have different colors for the outline and
interior. Changing the color
aesthetic will only change the
outline of our points:
# Set outline color of the shapes by adding `color = cyan4`
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point(shape = 21, # set shapes to display
color = "cyan4") # set outline color
Now let’s fill in the points:
# Set interior color of the shapes by adding `fill = "seagreen"`
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point(shape = 21, # set shapes to display
color = "cyan4", # set outline color
fill = "seagreen") # set fill color
We can improve the readability by increasing size and reducing
opcaity with size
and alpha
, like we did
before:
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point(shape = 21, # set shapes to display
color = "cyan4", # set outline color
fill = "seagreen", # set fill color
size = 2, # set size (mm)
alpha = 0.75) # set level of opacity
It can be hard to view relationships or trends with just points alone. Often we want to add a smoothing line in order to see what the trends look like. This can be especially helpful when trying to understand regressions.
To get a better idea of the relationship between these to variables, we can add a trend line (also known as a best fit line or a smoothing line).
To do this, we add the function geom_smooth()
to our
scatter plot:
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The smoothing line comes after our points an another geometric layer added onto our plot.
The default smoothing function used in this scatter plot is “loess” which stands for for locally weighted scatter plot smoothing. Loess smoothing is a process used by many statistical softwares. In {ggplot2} this generally should be done when you have less than 1000 points, otherwise it can be time consuming.
Many other smoothing functions can also be used in
geom_smooth()
.
Let’s request a linear regression method. This time we will use a
generalized linear model by setting the method
argument
inside geom_smooth()
:
# Change to a linear smoothing function with `method = "glm"`
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point() +
geom_smooth(method = "glm")
## `geom_smooth()` using formula = 'y ~ x'
By default, 95% confidence limits for these lines are displayed.
You can suppress the confidence bands by including the argument
se = FALSE
inside geom_smooth()
:
# Remove confidence interval bands by adding `se = FALSE`
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point() +
geom_smooth(method = "glm",
se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
In addition to changing the method, let’s add the color
argument inside geom_smooth()
to change the color of the
line.
# Change the color of the trend line by adding `color = "darkred"`
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point() +
geom_smooth(method = "glm",
se = FALSE,
color = "darkred")
## `geom_smooth()` using formula = 'y ~ x'
This linear regression concurs with what we initially observed in the
first scatter plot. A negative relationship exists between
age_months
and viral_load
: as age increases,
viral load tends to decrease.
Let’s add a third variable from the malidd
dataset
calledvomit
. This which is a binary variable that records
whether or not the patient vomited. We will add the vomit
variable to the plot by mapping it to the color aesthetic. We will again
change the smoothing method to generalized additive model
(“gam
”) and make some aesthetic modifications to the line
in the geom_smooth()
layer.
ggplot(data = malidd,
mapping = aes(x = age_months,
y = viral_load)) +
geom_point(mapping = aes(color = factor(vomit))) +
geom_smooth(method = "gam",
size = 1.5,
color = "darkgray")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2
## 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this
## warning was generated.
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
Observe the distribution of blue points (children who vomited) compared to red points (children who did not vomit). The blue points mostly occur above the trend line. This shows that higher viral loads were not only associated with younger children, but that children with higher viral loads were more likely to exhibit symptoms of vomiting.
Create a scatter plot with the age_months
and
viral_load
variables. Set the color of the points to
“steelblue”, the size to 2.5mm, the opacity to 80%. Then add trend line
with the smoothing method “lm” (linear model). To make the trend line
stand out, set its color to “indianred3”.
Recreate the plot you made in the previous question, but this
time adapt the code to change the shape of the points to tilted
rectangles (number 23), and add the body temperature variable
(temp
) by mapping it to fill color of the
points.
Scatter plots display the relationship between two numerical variables.
With medium to large datasets, you may need to play around with the different modifications to scatter plots we saw such as adding trend lines, changing the color, size, shape, fill, or opacity of the points. This tweaking is often a fun part of data visualization, since you’ll have the chance to see different relationships emerge as you tinker with your plots.
The following team members contributed to this lesson:
Some material in this lesson was adapted from the following sources:
This work is licensed under the Creative Commons Attribution Share Alike license.