ggplot2

Visualization is key to telling the data’s story, and it can take a lot of work to get things to look just right. But, it can also be a lot of fun, so let’s dive in!

When it comes to visualization, the most popular package used in R is ggplot2. It’s so popular, it or its aesthetic is even copied in other languages/programs as well. It entails a grammar of graphics (hence the gg), and learning that grammar is key to using it effectively. Some of the strengths of ggplot2 include:

The ease of getting a good looking plot
Easy customization
A lot of necessary data processing is done for you
Clear syntax
Easy multidimensional approach
Decent default color scheme as a default
Lots of extensions

Every graph is built from the same few parts, and it’s important to be aware of a few key ideas, which we will cover in turn.

Layers (and geoms)
Piping
Aesthetics
Facets
Scales
Themes
Extensions

Note that while you can obviously use base R for visualization, it’s not going to be as easy or as flexible as ggplot2.

Layers

In general, we start with a base layer and add to it. In most cases you’ll start as follows.

# recall that starwars is in the dplyr package
ggplot(aes(x = height, y = mass), data = starwars)

The code above would just produce a plot background, but nothing else. However, with the foundation in place, we’re now ready to add something to it. Let’s add some points (the outlier is Jabba the Hut).

ggplot(aes(x = height, y = mass), data = starwars) +
  geom_point()

Perhaps we want to change labels or theme. These would be additional layers to the plot.

ggplot(aes(x = height, y = mass), data = starwars) +
  geom_point(color = 'white') +
  labs(x = 'Height in cm', y = 'Weight in kg') +
  theme_dark()

Each layer is consecutively added by means of a pipe operator, and layers may regard geoms, scales, labels, facets etc. You may have many different layers to produce one plot, and there really is no limit. However some efficiencies may be possible for a given situation. For example, it’s more straightforward to use geom_smooth than calculate fits, standard errors etc. and then add multiple geoms to produce the same thing. This is the sort of thing you’ll get used to as you use ggplot more.

Piping

As we saw, layers are added via piping (+). The first layers added after the base are typically geoms, or geometric objects that represent the data, and include things like:

points
lines
density
text

In case you’re wondering why ggplot doesn’t use %>% as in the tidyverse and other visualization packages, it’s because ggplot2 was using pipes before it was cool, well before those came along. Otherwise, the concept is the same as we saw in the data processing section.

ggplot(aes(x = myvar, y = myvar2), data = mydata) +
  geom_point()

Our base is provided via the ggplot function, and specifies the data at the very least, but commonly also the x and y aesthetics.

The geom_point function adds a layer of points, and now we would have a scatterplot. Alternatively, you could have specified the x and y aesthetic at the geom_point layer, but if you’re going to have the same x, y, color, etc. aesthetics regardless of layer, put it in the base. Otherwise, doing it by layer gives you more flexibility if needed. Geoms even have their own data argument, allowing you to combine information from several sources for a single visualization.

Aesthetics

Aesthetics map data to various visual aspects of the plot, including size, color etc. The function used in ggplot to do this is aes.

aes(
  x = myvar,
  y = myvar2,
  color = myvar3,
  group = g
)

The best way to understand what goes into the aes function is if the value is varying. For example, if I want the size of points to be a certain value, I would code the following.

... +
  geom_point(..., size = 4)

However, if I want the size to be associated with the data in some way, I use it as an aesthetic.

... +
  geom_point(aes(size = myvar))

The same goes for practically any aspect of a geom- size, color, fill, etc. If it is a fixed value, set it outside the aesthetic. If it varies based on the data, put it within an aesthetic.

Geoms

In the ggplot2 world, geoms are the geometric objects- shapes, lines, and other parts of the visualization we want to display. Even if you use ggplot2 a lot, you probably didn’t know about many or most of these.

geom_abline: Reference lines: horizontal, vertical, and diagonal
geom_area: Ribbons and area plots
geom_bar: Bar charts
geom_bin2d: Heatmap of 2d bin counts
geom_blank: Draw nothing
geom_boxplot: A box and whiskers plot (in the style of Tukey)
geom_col: Bar charts
geom_contour: 2d contours of a 3d surface
geom_count: Count overlapping points
geom_crossbar: Vertical intervals: lines, crossbars & errorbars
geom_curve: Line segments and curves
geom_density: Smoothed density estimates
geom_density_2d: Contours of a 2d density estimate
geom_dotplot: Dot plot
geom_errorbar: Vertical intervals: lines, crossbars & errorbars
geom_errorbarh: Horizontal error bars
geom_freqpoly: Histograms and frequency polygons
geom_hex: Hexagonal heatmap of 2d bin counts
geom_histogram: Histograms and frequency polygons
geom_hline: Reference lines: horizontal, vertical, and diagonal
geom_jitter: Jittered points
geom_label: Text
geom_line: Connect observations
geom_linerange: Vertical intervals: lines, crossbars & errorbars
geom_map: Polygons from a reference map
geom_path: Connect observations
geom_point: Points
geom_pointrange: Vertical intervals: lines, crossbars & errorbars
geom_polygon: Polygons
geom_qq: A quantile-quantile plot
geom_qq_line: A quantile-quantile plot
geom_quantile: Quantile regression
geom_raster: Rectangles
geom_rect: Rectangles
geom_ribbon: Ribbons and area plots
geom_rug: Rug plots in the margins
geom_segment: Line segments and curves
geom_sf: Visualise sf objects
geom_sf_label: Visualise sf objects
geom_sf_text: Visualise sf objects
geom_smooth: Smoothed conditional means
geom_spoke: Line segments parameterised by location, direction and distance
geom_step: Connect observations
geom_text: Text
geom_tile: Rectangles
geom_violin: Violin plot
geom_vline: Reference lines: horizontal, vertical, and diagonal

Examples

Let’s get more of a feel for things by seeing some examples that demonstrate some geoms and aesthetics.

To begin, after setting the base aesthetic, we’ll set some explicit values for the geom.

library(ggplot2)

data("diamonds")

data('economics')

ggplot(aes(x = carat, y = price), data = diamonds) +
  geom_point(size = .5, color = 'peru')

Next we use two different geoms, and one is even using a different data source. Note that geoms have arguments both common and specific to them. In the following, label is used for geom_text, but it would be ignored by geom_line.

ggplot(aes(x = date, y = unemploy), data = economics) +
  geom_line() +
  geom_text(
    aes(label = unemploy),
    vjust = -.5,
    data = filter(economics, date == '2009-10-01')
  )

In the following, one setting, alpha (transparency), is not mapped to the data, while size and color are⁴⁵.

ggplot(aes(x = carat, y = price), data = diamonds) +
  geom_point(aes(size = carat, color = clarity), alpha = .05)

There are some other options to play with as well.

ggplot(aes(x = carat, y = price), data = diamonds %>% sample_frac(.01)) +
  geom_point(aes(size = carat, color = clarity), key_glyph = "vpath")

Stats

There are many statistical functions built in, and it is a key strength of ggplot that you don’t have to do a lot of processing for very common plots.

Her are some quantile regression lines:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_quantile()

The following shows loess (or additive model) smooths. We can do some fine-tuning and use model-based approaches for visualization.

data(mcycle, package = 'MASS')

ggplot(aes(x = times, y = accel), data = mcycle) +
  geom_point() +
  geom_smooth(formula = y ~ s(x, bs = 'ad'), method = 'gam')

Bootstrapped confidence intervals:

ggplot(mtcars, aes(cyl, mpg)) +
  geom_point() +
  stat_summary(
    fun.data = "mean_cl_boot",
    colour = "orange",
    alpha = .75,
    size = 1
  )

The take-home message here is to always let ggplot do the work for you if at all possible. However, I will say that I find it easier to create the summary data I want to visualize with tidyverse tools, rather than use stat_summary, and you may have a similar experience.

Scales

Often there are many things we want to change about the plot, for example, the size and values of axis labels, the range of sizes for points to take, the specific colors we want to use, and so forth. Be aware that there are a great many options here, and you will regularly want to use them.

A very common thing you’ll do is change the labels for the axes. You definitely don’t have to go and change the variable name itself to do this, just use the labs function. There are also functions for individual parts, e.g. xlab, ylab and ggtitle.

ggplot(aes(x = times, y = accel), data = mcycle) +
  geom_smooth(se = FALSE) +
  labs(
    x     = 'milliseconds after impact', 
    y     = 'head acceleration', 
    title = 'Motorcycle Accident'
  )

A frequent operation is changing the x and y look in the form of limits and tick marks. Like labs, there is a general lims function and specific functions for just the specific parts. In addition, we may want to get really detailed using scale_x_* or scale_y_*.

ggplot(mpg, aes(x = displ, y = hwy, size = cyl)) +
  geom_point() +
  ylim(c(0, 60))

ggplot(mpg, aes(x = displ, y = hwy, size = cyl)) +
  geom_point() +
  scale_y_continuous(
    limits = c(0, 60),
    breaks = seq(0, 60, by = 12),
    minor_breaks = seq(6, 60, by = 6)
  )

Another common option is to change the size of points in some way. While we assign the aesthetic as before, it comes with defaults that might not work for a given situation. Play around with the range values.

ggplot(mpg, aes(x = displ, y = hwy, size = cyl)) +
  geom_point() +
  scale_size(range = c(1, 3))

We will talk about color issues later, but for now, you may want to apply something besides the default options. The following shows a built-in color scale for a color aesthetic that is treated as continuous, and one that is discrete and which we want to supply our own colors (these actually come from plotly’s default color scheme).

ggplot(mpg, aes(x = displ, y = hwy, color = cyl)) +
  geom_point() +
  scale_color_gradient2()

ggplot(mpg, aes(x = displ, y = hwy, color = factor(cyl))) +
  geom_point() +
  scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"))

We can even change the scale of the data itself.

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_x_log10()

In short, scale alterations are really useful for getting just the plot you want, and there is a lot of flexibility for you to work with. There are a lot of scales too, so know what you have available.

scale_alpha, scale_alpha_continuous, scale_alpha_date, scale_alpha_datetime, scale_alpha_discrete, scale_alpha_identity, scale_alpha_manual, scale_alpha_ordinal: Alpha transparency scales
scale_color_brewer, scale_color_distiller: Sequential, diverging and qualitative colour scales from colorbrewer.org
scale_color_continuous, scale_color_discrete, scale_color_gradient, scale_color_gradient2, scale_color_gradientn, scale_color_grey, scale_color_hue, scale_color_identity, scale_color_manual, scale_color_viridis_c, scale_color_viridis_d, scale_continuous_identity Various color scales
scale_discrete_identity, scale_discrete_manual: Discrete scales
scale_fill_brewer, scale_fill_continuous, scale_fill_date, scale_fill_datetime, scale_fill_discrete, scale_fill_distiller, scale_fill_gradient, scale_fill_gradient2, scale_fill_gradientn, scale_fill_grey, scale_fill_hue, scale_fill_identity, scale_fill_manual, scale_fill_ordinal, scale_fill_viridis_c, scale_fill_viridis_d: Scales for geoms that can be filled with color
scale_linetype, scale_linetype_continuous, scale_linetype_discrete, scale_linetype_identity, scale_linetype_manual: Scales for line patterns
scale_shape, scale_shape_continuous, scale_shape_discrete, scale_shape_identity, scale_shape_manual, scale_shape_ordinal: Scales for shapes, aka glyphs
scale_size, scale_size_area, scale_size_continuous, scale_size_date, scale_size_datetime, scale_size_discrete, scale_size_identity, scale_size_manual, scale_size_ordinal: Scales for area or radius
scale_x_continuous, scale_x_date, scale_x_datetime, scale_x_discrete, scale_x_log10, scale_x_reverse, scale_x_sqrt, < scale_y_continuous, scale_y_date, scale_y_datetime, scale_y_discrete, scale_y_log10, scale_y_reverse, scale_y_sqrt: Position scales for continuous data (x & y)
scale_x_time, scale_y_time: Position scales for date/time data

Facets

Facets allow for paneled display, a very common operation. In general, we often want comparison plots. The facet_grid function will produce a grid, and often this is all that’s needed. However, facet_wrap is more flexible, while possibly taking a bit extra effort to get things just the way you want. Both use a formula approach to specify the grouping.

facet_grid

Facet by cylinder.

ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  facet_grid(~ cyl)

Facet by vs and cylinder.

ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  facet_grid(vs ~ cyl, labeller = label_both)

Multiple plots

Often we want distinct visualizations to come together in one plot. There are several packages that can help you here: gridExtra, cowplot, and more recently patchwork⁴⁶. The latter especially makes things easy.

library(patchwork)

g1 = ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point()

g2 = ggplot(mtcars, aes(wt)) + 
  geom_density()

g3 = ggplot(mtcars, aes(mpg)) + 
  geom_density()

g1 /                       # initial plot, place next part underneath
  (g2 | g3)                # groups g2 and g3 side by side

Not that you want this, but just to demonstrate the flexibility.

p1 = ggplot(mtcars) + geom_point(aes(mpg, disp))
p2 = ggplot(mtcars) + geom_boxplot(aes(gear, disp, group = gear))
p3 = ggplot(mtcars) + geom_smooth(aes(disp, qsec))
p4 = ggplot(mtcars) + geom_bar(aes(carb))
p5 = ggplot(mtcars) + geom_violin(aes(cyl, mpg, group = cyl))

p1 +
  p2 +
  (p3 / p4) * theme_void() +
  p5 +
  plot_layout(widths = c(2, 1))

You’ll typically want to use facets to show subsets of the same data, and tools like patchwork to show different kinds of plots together.

Fine control

ggplot2 makes it easy to get good looking graphs quickly. However the amount of fine control is extensive. The following plot is hideous (aside from the background, which is totally rad), but illustrates the point.

ggplot(aes(x = carat, y = price), data = diamonds) +
  annotation_custom(
    rasterGrob(
      lambosun,
      width = unit(1, "npc"),
      height = unit(1, "npc"),
      interpolate = FALSE
    ),-Inf,
    Inf,
    -Inf,
    Inf
  ) +
  geom_point(aes(color = clarity), alpha = .5) +
  scale_y_log10(breaks = c(1000, 5000, 10000)) +
  xlim(0, 10) +
  scale_color_brewer(type = 'div') +
  facet_wrap( ~ cut, ncol = 3) +
  theme_minimal() +
  theme(
    axis.ticks.x = element_line(color = 'darkred'),
    axis.text.x = element_text(angle = -45),
    axis.text.y = element_text(size = 20),
    strip.text = element_text(color = 'forestgreen'),
    strip.background = element_blank(),
    panel.grid.minor = element_line(color = 'lightblue'),
    legend.key = element_rect(linetype = 4),
    legend.position = 'bottom'
  )

Themes

In the last example you saw two uses of a theme- a built-in version that comes with ggplot (theme_minimal), and specific customization (theme(…)). The built-in themes provide ready-made approaches that might already be good enough for a finished product. For the theme function, each argument, and there are many, takes on a specific value or an element function:

element_rect
element_line
element_text
element_blank

Each of those element functions has arguments specific to it. For example, for element_text you can specify the font size, while for element line you could specify the line type.

Note that the base theme of ggplot, and I would say every plotting package, is probably going to need manipulation before a plot is ready for presentation. For example, the ggplot theme doesn’t work well for web presentation, and is even worse for print. You will almost invariably need to tweak it. I suggest using and saving your own custom theme for easy application for any visualization package you use frequently.

Extensions

ggplot2 now has its own extension system, and there is even a website to track the extensions. Examples include:

additional themes
maps
interactivity
animations
marginal plots
network graphs
time series
aligning multiple ggplot visualizations, possibly of different types

Here’s an example with gganimate.

library(gganimate)

load('data/gapminder.RData')

gap_plot = gapminder_2019 %>% 
  filter(giniPercap != 40) 

gap_plot_filter = gap_plot %>% 
  filter(country %in% c('United States', 'Mexico', 'Canada'))

initial_plot = ggplot(gap_plot, aes(x = year, y = giniPercap, group = country)) +
  geom_line(alpha = .05) +
  geom_path(
    aes(color = country),
    lwd = 2,
    arrow = arrow(
      length = unit(0.25, "cm")
    ), 
    alpha = .5,
    data = gap_plot_filter,
    show.legend = FALSE
  ) +
  geom_text(
    aes(color = country, label = country),
    nudge_x = 5, 
    nudge_y = 2, 
    size = 2,
    data = gap_plot_filter,
    show.legend = FALSE
  ) +
  theme_clean() +
  transition_reveal(year)

animate(initial_plot, end_pause = 50, nframes = 150, rewind = TRUE)

As one can see, ggplot2 is only the beginning. You’ll have a lot of tools at your disposal. Furthermore, many modeling and other packages will produce ggplot graphics to which you can add your own layers and tweak like you would any other ggplot.

ggplot2 Summary

ggplot2 is an easy to use, but powerful visualization tool. It allows one to think in many dimensions for any graph, and extends well beyond the basics. Use it to easily create more interesting visualizations.

ggplot2 Exercises

Exercise 0

Load the ggplot2 package if you haven’t already.

Exercise 1

Create two plots, one a scatterplot (e.g. with geom_point) and one with lines (e.g. geom_line) with a data set of your choosing (all of the following are base R or available after loading ggplot2. Some suggestions:

faithful: Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
msleep: mammals sleep dataset with sleep times and weights etc.
diamonds: used in the slides
economics: US economic time series.
txhousing: Housing sales in TX.
midwest: Midwest demographics.
mpg: Fuel economy data from 1999 and 2008 for 38 popular models of car

Recall the basic form for ggplot.

ggplot(data = *, aes(x = *, y = *, other)) +
  geom_*() +
  otherLayers, theme etc.

Themes to play with:

theme_bw
theme_classic
theme_dark
theme_gray
theme_light
theme_linedraw
theme_minimal
theme_clean (requires the visibly package and an appreciation of the Lamborghini background from the previous visualization)

Exercise 2

Play around and change the arguments to the following. You’ll need to install the maps package.

For example, do points for all county midpoints. For that you’d need to change the x and y for the point geom to an aesthetic based on the longitude and latitude, as well as add its data argument to use the seats data frame.
Make the color of the points or text based on subregion. This will require adding the fill argument to the polygon geom and removing the NA setting. In addition, add the argument show.legend=F (outside the aesthetic), or you’ll have a problematic legend (recall what we said before about too many colors!). Try making color based on subregion too.
See if you can use element_blank on a theme argument to remove the axis information. See ?theme for ideas.

library(maps)
mi = map_data("county", "michigan")
seats = mi %>% 
  group_by(subregion) %>% 
  summarise_at(vars(lat, long), function(x) median(range(x)))

# inspect the data
# head(mi)
# head(seats)

ggplot(mi, aes(long, lat)) +
  geom_polygon(aes(group = subregion), fill = NA, colour = "grey60") +
  geom_text(aes(label = subregion), data = seats, size = 1, angle = 45) +
  geom_point(x=-83.748333, y=42.281389, color='#1e90ff', size=3) +
  theme_minimal() +
  theme(panel.grid=element_blank())

Python Plotnine Notebook

The R community really lucked out with ggplot, and the basic philosophy behind it is missing from practically every other static plotting packages or tools. Python’s version of base R plotting is matplotlib, which continues to serve people well. But like R base plots, it can take a lot of work to get anything remotely visually appealing. Seaborn is another option, but still, just isn’t in the same league.

If using Python though, you’re in luck! You get most of the basic functionality of ggplot2 via the plotnine module. A jupyter notebook demonstrating most of the previous is available here.

The reason the legend is hard to make out is because the transparency is mapped to the colors and size. While I personally have never come across this being a desirable situation, nor can fathom why it would be the default, it can be fixed via + guides(colour = guide_legend(override.aes = list(alpha = 1)), size = guide_legend(override.aes = list(alpha = 1))). Apparently because that’s fun to have to do every time and easy to remember.↩︎
The author of patchwork is the same as that for the scico package. He also provides gganimate, lime and other useful tools.↩︎