Visualization is key to telling the data’s story, and it can take a lot of work to get things to look just right. But, it can also be a lot of fun, so let’s dive in!
When it comes to visualization, the most popular package used in R is ggplot2. It’s so popular, it or its aesthetic is even copied in other languages/programs as well. It entails a grammar of graphics (hence the gg), and learning that grammar is key to using it effectively. Some of the strengths of ggplot2 include:
- The ease of getting a good looking plot
- Easy customization
- A lot of necessary data processing is done for you
- Clear syntax
- Easy multidimensional approach
- Decent default color scheme as a default
- Lots of extensions
Every graph is built from the same few parts, and it’s important to be aware of a few key ideas, which we will cover in turn.
- Layers (and geoms)
Note that while you can obviously use base R for visualization, it’s not going to be as easy or as flexible as ggplot2.
In general, we start with a base layer and add to it. In most cases you’ll start as follows.
The code above would just produce a plot background, but nothing else. However, with the foundation in place, we’re now ready to add something to it. Let’s add some points (the outlier is Jabba the Hut).
Perhaps we want to change labels or theme. These would be additional layers to the plot.
Each layer is consecutively added by means of a pipe operator, and layers may regard geoms, scales, labels, facets etc. You may have many different layers to produce one plot, and there really is no limit. However some efficiencies may be possible for a given situation. For example, it’s more straightforward to use geom_smooth than calculate fits, standard errors etc. and then add multiple geoms to produce the same thing. This is the sort of thing you’ll get used to as you use ggplot more.
As we saw, layers are added via piping (+). The first layers added after the base are typically geoms, or geometric objects that represent the data, and include things like:
In case you’re wondering why ggplot doesn’t use
%>% as in the tidyverse and other visualization packages, it’s because ggplot2 was using pipes before it was cool, well before those came along. Otherwise, the concept is the same as we saw in the data processing section.
Our base is provided via the ggplot function, and specifies the data at the very least, but commonly also the x and y aesthetics.
The geom_point function adds a layer of points, and now we would have a scatterplot. Alternatively, you could have specified the x and y aesthetic at the geom_point layer, but if you’re going to have the same x, y, color, etc. aesthetics regardless of layer, put it in the base. Otherwise, doing it by layer gives you more flexibility if needed. Geoms even have their own data argument, allowing you to combine information from several sources for a single visualization.
Aesthetics map data to various visual aspects of the plot, including size, color etc. The function used in ggplot to do this is aes.
The best way to understand what goes into the aes function is if the value is varying. For example, if I want the size of points to be a certain value, I would code the following.
However, if I want the size to be associated with the data in some way, I use it as an aesthetic.
The same goes for practically any aspect of a geom- size, color, fill, etc. If it is a fixed value, set it outside the aesthetic. If it varies based on the data, put it within an aesthetic.
In the ggplot2 world, geoms are the geometric objects- shapes, lines, and other parts of the visualization we want to display. Even if you use ggplot2 a lot, you probably didn’t know about many or most of these.
- geom_abline: Reference lines: horizontal, vertical, and diagonal
- geom_area: Ribbons and area plots
- geom_bar: Bar charts
- geom_bin2d: Heatmap of 2d bin counts
- geom_blank: Draw nothing
- geom_boxplot: A box and whiskers plot (in the style of Tukey)
- geom_col: Bar charts
- geom_contour: 2d contours of a 3d surface
- geom_count: Count overlapping points
- geom_crossbar: Vertical intervals: lines, crossbars & errorbars
- geom_curve: Line segments and curves
- geom_density: Smoothed density estimates
- geom_density_2d: Contours of a 2d density estimate
- geom_dotplot: Dot plot
- geom_errorbar: Vertical intervals: lines, crossbars & errorbars
- geom_errorbarh: Horizontal error bars
- geom_freqpoly: Histograms and frequency polygons
- geom_hex: Hexagonal heatmap of 2d bin counts
- geom_histogram: Histograms and frequency polygons
- geom_hline: Reference lines: horizontal, vertical, and diagonal
- geom_jitter: Jittered points
- geom_label: Text
- geom_line: Connect observations
- geom_linerange: Vertical intervals: lines, crossbars & errorbars
- geom_map: Polygons from a reference map
- geom_path: Connect observations
- geom_point: Points
- geom_pointrange: Vertical intervals: lines, crossbars & errorbars
- geom_polygon: Polygons
- geom_qq: A quantile-quantile plot
- geom_qq_line: A quantile-quantile plot
- geom_quantile: Quantile regression
- geom_raster: Rectangles
- geom_rect: Rectangles
- geom_ribbon: Ribbons and area plots
- geom_rug: Rug plots in the margins
- geom_segment: Line segments and curves
- geom_sf: Visualise sf objects
- geom_sf_label: Visualise sf objects
- geom_sf_text: Visualise sf objects
- geom_smooth: Smoothed conditional means
- geom_spoke: Line segments parameterised by location, direction and distance
- geom_step: Connect observations
- geom_text: Text
- geom_tile: Rectangles
- geom_violin: Violin plot
- geom_vline: Reference lines: horizontal, vertical, and diagonal
Let’s get more of a feel for things by seeing some examples that demonstrate some geoms and aesthetics.
To begin, after setting the base aesthetic, we’ll set some explicit values for the geom.
Next we use two different geoms, and one is even using a different data source. Note that geoms have arguments both common and specific to them. In the following,
label is used for geom_text, but it would be ignored by geom_line.
In the following, one setting, alpha (transparency), is not mapped to the data, while size and color are45.
There are some other options to play with as well.
There are many statistical functions built in, and it is a key strength of ggplot that you don’t have to do a lot of processing for very common plots.
Her are some quantile regression lines:
The following shows loess (or additive model) smooths. We can do some fine-tuning and use model-based approaches for visualization.
Bootstrapped confidence intervals:
The take-home message here is to always let ggplot do the work for you if at all possible. However, I will say that I find it easier to create the summary data I want to visualize with tidyverse tools, rather than use stat_summary, and you may have a similar experience.
Often there are many things we want to change about the plot, for example, the size and values of axis labels, the range of sizes for points to take, the specific colors we want to use, and so forth. Be aware that there are a great many options here, and you will regularly want to use them.
A very common thing you’ll do is change the labels for the axes. You definitely don’t have to go and change the variable name itself to do this, just use the labs function. There are also functions for individual parts, e.g. xlab, ylab and ggtitle.
A frequent operation is changing the x and y look in the form of limits and tick marks. Like labs, there is a general lims function and specific functions for just the specific parts. In addition, we may want to get really detailed using scale_x_* or scale_y_*.
Another common option is to change the size of points in some way. While we assign the aesthetic as before, it comes with defaults that might not work for a given situation. Play around with the range values.
We will talk about color issues later, but for now, you may want to apply something besides the default options. The following shows a built-in color scale for a color aesthetic that is treated as continuous, and one that is discrete and which we want to supply our own colors (these actually come from plotly’s default color scheme).
We can even change the scale of the data itself.
In short, scale alterations are really useful for getting just the plot you want, and there is a lot of flexibility for you to work with. There are a lot of scales too, so know what you have available.
scale_alpha, scale_alpha_continuous, scale_alpha_date, scale_alpha_datetime, scale_alpha_discrete, scale_alpha_identity, scale_alpha_manual, scale_alpha_ordinal: Alpha transparency scales
scale_color_brewer, scale_color_distiller: Sequential, diverging and qualitative colour scales from colorbrewer.org
scale_color_continuous, scale_color_discrete, scale_color_gradient, scale_color_gradient2, scale_color_gradientn, scale_color_grey, scale_color_hue, scale_color_identity, scale_color_manual, scale_color_viridis_c, scale_color_viridis_d, scale_continuous_identity Various color scales
scale_discrete_identity, scale_discrete_manual: Discrete scales
scale_fill_brewer, scale_fill_continuous, scale_fill_date, scale_fill_datetime, scale_fill_discrete, scale_fill_distiller, scale_fill_gradient, scale_fill_gradient2, scale_fill_gradientn, scale_fill_grey, scale_fill_hue, scale_fill_identity, scale_fill_manual, scale_fill_ordinal, scale_fill_viridis_c, scale_fill_viridis_d: Scales for geoms that can be filled with color
scale_linetype, scale_linetype_continuous, scale_linetype_discrete, scale_linetype_identity, scale_linetype_manual: Scales for line patterns
scale_shape, scale_shape_continuous, scale_shape_discrete, scale_shape_identity, scale_shape_manual, scale_shape_ordinal: Scales for shapes, aka glyphs
scale_size, scale_size_area, scale_size_continuous, scale_size_date, scale_size_datetime, scale_size_discrete, scale_size_identity, scale_size_manual, scale_size_ordinal: Scales for area or radius
scale_x_continuous, scale_x_date, scale_x_datetime, scale_x_discrete, scale_x_log10, scale_x_reverse, scale_x_sqrt, < scale_y_continuous, scale_y_date, scale_y_datetime, scale_y_discrete, scale_y_log10, scale_y_reverse, scale_y_sqrt: Position scales for continuous data (x & y)
scale_x_time, scale_y_time: Position scales for date/time data
Facets allow for paneled display, a very common operation. In general, we often want comparison plots. The facet_grid function will produce a grid, and often this is all that’s needed. However, facet_wrap is more flexible, while possibly taking a bit extra effort to get things just the way you want. Both use a formula approach to specify the grouping.
Facet by cylinder.
Facet by vs and cylinder.
Often we want distinct visualizations to come together in one plot. There are several packages that can help you here: gridExtra, cowplot, and more recently patchwork46. The latter especially makes things easy.
Not that you want this, but just to demonstrate the flexibility.
p1 = ggplot(mtcars) + geom_point(aes(mpg, disp)) p2 = ggplot(mtcars) + geom_boxplot(aes(gear, disp, group = gear)) p3 = ggplot(mtcars) + geom_smooth(aes(disp, qsec)) p4 = ggplot(mtcars) + geom_bar(aes(carb)) p5 = ggplot(mtcars) + geom_violin(aes(cyl, mpg, group = cyl)) p1 + p2 + (p3 / p4) * theme_void() + p5 + plot_layout(widths = c(2, 1))
You’ll typically want to use facets to show subsets of the same data, and tools like patchwork to show different kinds of plots together.
ggplot2 makes it easy to get good looking graphs quickly. However the amount of fine control is extensive. The following plot is hideous (aside from the background, which is totally rad), but illustrates the point.
ggplot(aes(x = carat, y = price), data = diamonds) + annotation_custom( rasterGrob( lambosun, width = unit(1, "npc"), height = unit(1, "npc"), interpolate = FALSE ),-Inf, Inf, -Inf, Inf ) + geom_point(aes(color = clarity), alpha = .5) + scale_y_log10(breaks = c(1000, 5000, 10000)) + xlim(0, 10) + scale_color_brewer(type = 'div') + facet_wrap( ~ cut, ncol = 3) + theme_minimal() + theme( axis.ticks.x = element_line(color = 'darkred'), axis.text.x = element_text(angle = -45), axis.text.y = element_text(size = 20), strip.text = element_text(color = 'forestgreen'), strip.background = element_blank(), panel.grid.minor = element_line(color = 'lightblue'), legend.key = element_rect(linetype = 4), legend.position = 'bottom' )
In the last example you saw two uses of a theme- a built-in version that comes with ggplot (theme_minimal), and specific customization (theme(…)). The built-in themes provide ready-made approaches that might already be good enough for a finished product. For the theme function, each argument, and there are many, takes on a specific value or an element function:
Each of those element functions has arguments specific to it. For example, for element_text you can specify the font size, while for element line you could specify the line type.
Note that the base theme of ggplot, and I would say every plotting package, is probably going to need manipulation before a plot is ready for presentation. For example, the ggplot theme doesn’t work well for web presentation, and is even worse for print. You will almost invariably need to tweak it. I suggest using and saving your own custom theme for easy application for any visualization package you use frequently.
ggplot2 now has its own extension system, and there is even a website to track the extensions. Examples include:
- additional themes
- marginal plots
- network graphs
- time series
- aligning multiple ggplot visualizations, possibly of different types
Here’s an example with gganimate.
library(gganimate) load('data/gapminder.RData') gap_plot = gapminder_2019 %>% filter(giniPercap != 40) gap_plot_filter = gap_plot %>% filter(country %in% c('United States', 'Mexico', 'Canada')) initial_plot = ggplot(gap_plot, aes(x = year, y = giniPercap, group = country)) + geom_line(alpha = .05) + geom_path( aes(color = country), lwd = 2, arrow = arrow( length = unit(0.25, "cm") ), alpha = .5, data = gap_plot_filter, show.legend = FALSE ) + geom_text( aes(color = country, label = country), nudge_x = 5, nudge_y = 2, size = 2, data = gap_plot_filter, show.legend = FALSE ) + theme_clean() + transition_reveal(year) animate(initial_plot, end_pause = 50, nframes = 150, rewind = TRUE)
As one can see, ggplot2 is only the beginning. You’ll have a lot of tools at your disposal. Furthermore, many modeling and other packages will produce ggplot graphics to which you can add your own layers and tweak like you would any other ggplot.
ggplot2 is an easy to use, but powerful visualization tool. It allows one to think in many dimensions for any graph, and extends well beyond the basics. Use it to easily create more interesting visualizations.
Load the ggplot2 package if you haven’t already.
Create two plots, one a scatterplot (e.g. with geom_point) and one with lines (e.g. geom_line) with a data set of your choosing (all of the following are base R or available after loading ggplot2. Some suggestions:
- faithful: Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
- msleep: mammals sleep dataset with sleep times and weights etc.
- diamonds: used in the slides
- economics: US economic time series.
- txhousing: Housing sales in TX.
- midwest: Midwest demographics.
- mpg: Fuel economy data from 1999 and 2008 for 38 popular models of car
Recall the basic form for ggplot.
ggplot(data = *, aes(x = *, y = *, other)) + geom_*() + otherLayers, theme etc.
Themes to play with:
- theme_clean (requires the visibly package and an appreciation of the Lamborghini background from the previous visualization)
Play around and change the arguments to the following. You’ll need to install the maps package.
- For example, do points for all county midpoints. For that you’d need to change the x and y for the point geom to an aesthetic based on the longitude and latitude, as well as add its data argument to use the seats data frame.
- Make the color of the points or text based on
subregion. This will require adding the fill argument to the polygon geom and removing the NA setting. In addition, add the argument show.legend=F (outside the aesthetic), or you’ll have a problematic legend (recall what we said before about too many colors!). Try making color based on subregion too.
- See if you can use element_blank on a theme argument to remove the axis information. See ?theme for ideas.
library(maps) mi = map_data("county", "michigan") seats = mi %>% group_by(subregion) %>% summarise_at(vars(lat, long), function(x) median(range(x))) # inspect the data # head(mi) # head(seats) ggplot(mi, aes(long, lat)) + geom_polygon(aes(group = subregion), fill = NA, colour = "grey60") + geom_text(aes(label = subregion), data = seats, size = 1, angle = 45) + geom_point(x=-83.748333, y=42.281389, color='#1e90ff', size=3) + theme_minimal() + theme(panel.grid=element_blank())
Python Plotnine Notebook
The R community really lucked out with ggplot, and the basic philosophy behind it is missing from practically every other static plotting packages or tools. Python’s version of base R plotting is matplotlib, which continues to serve people well. But like R base plots, it can take a lot of work to get anything remotely visually appealing. Seaborn is another option, but still, just isn’t in the same league.
If using Python though, you’re in luck! You get most of the basic functionality of ggplot2 via the plotnine module. A jupyter notebook demonstrating most of the previous is available here.
The reason the legend is hard to make out is because the transparency is mapped to the colors and size. While I personally have never come across this being a desirable situation, nor can fathom why it would be the default, it can be fixed via
+ guides(colour = guide_legend(override.aes = list(alpha = 1)), size = guide_legend(override.aes = list(alpha = 1))). Apparently because that’s fun to have to do every time and easy to remember.↩︎
The author of patchwork is the same as that for the scico package. He also provides gganimate, lime and other useful tools.↩︎