Tidyverse

What is the Tidyverse?

The tidyverse consists of a few key packages:

  • ggplot2: data visualization
  • tibble: tibbles, a modern re-imagining of data frames
  • tidyr: data tidying
  • readr: data import
  • purrr: functional programming, e.g. alternate approaches to apply
  • dplyr: data manipulation

And of course the tidyverse package itself, which will load all of the above in a way that will avoid naming conflicts.

library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages -------------------------
filter(): dplyr, stats
lag():    dplyr, stats

In addition, there are other packages like lubridate, rvest, stringr and others in the hadleyverse that are also greatly useful.

What is Tidy?

Tidy data refers to data arranged in a way that makes data processing, analysis, and visualization simpler.

In a tidy data set:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

Think long before wide.

dplyr

dplyr provides a grammar of data manipulation (like ggplot2 does for visualization). It is the next iteration of plyr, but plyr is deprecated and no longer used. It’s focused on tools for working with data frames, with over 100 functions that might be of specific use to you.

It has three main goals:

  • Make the most important data manipulation tasks easier.
  • Do them faster.
  • Use the same interface to work with data frames, data tables or a database.

Some key operations include:

  • select: grab columns
    • select helpers: one_of, starts_with, num_range etc.
  • filter/slice: grab rows
  • group_by: grouped operations
  • mutate/transmute: create new variables
  • summarize: summarize/aggregate

There are various (SQL-like) join/merge functions:

  • inner_join, left_join etc.

And there are a lot of little things like:

  • n, n_distinct, nth, n_groups, count, recode, between

In addition, there is no need to quote variable names.

An example

Let’s say we want to select from our data the following variables:

  • Start with the ID variable
  • The variables X1 through X10, which are not all grouped together, and there are many more X* columns
  • The variables var1 and var2, which are the only variables with var in their name
  • Any variable with a name that starts with XYZ

How might we go about this in a dataset of possibly hundreds or even thousands of columns? There are several base R approaches that we could go with, but often they will be tedious, or require multiple objects to be created just to get the columns you want. Let’s start with the worst choice.

newData = oldData[,c(1,2,3,4, etc.)]

Using numeric indexes, or rather magic numbers, is not conducive to readability or reproducibility. If anything changes about the data columns, the numbers may no longer be applicable, and you’d have to redo the line again.

We could name the variables explicitly.

newData = oldData[,c('ID','X1', 'X2', etc.)]

This would be fine if there are only a handful. But if you’re trying to reduce a 1000 column data set to several dozen it’s tedious, and generally not pretty regardless.

A more advanced alternative regards a two-step approach with regular expressions. This requires that you know something about regex (and you should), but it is difficult to read/understand by those who don’t, and often by even yourself if it’s more complicated. In any case, you first will need to create an object that represents the column names first, otherwise it looks unwieldy if used within brackets or a function like subset.

cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))

newData = oldData[,cols]

# or via subset
newData = subset(oldData, select = cols)

Now consider there is even more to do. What if you also want observations where Z is Yes, Q is No, and only the observations with the top 50 values of var2, ordered by var1 (descending)? Probably the more straightforward way in R to do so would be something like the following, where each part is broken out and we continuously write over the object as we modify it.

# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]

And this is for fairly straightforward operations.

Now consider doing all of the previous in one piped operation. The dplyr package will allow us to do something like the following.

newData = oldData %>% 
  select(num_range('X', 1:10), 
         contains('var'), 
         starts_with('XYZ')) %>% 
  filter(Z == 'Yes', 
         Q == 'No') %>% 
  top_n(n=50, var2) %>% 
  arrange(desc(var1))

Even if it hadn’t been explained before, you might have been able to guess a little as to what was going on. The code is fairly succinct, we don’t have to keep referencing objects repeatedly, and no explicit intermediary objects are created.

dplyr and piping is an alternative. You can do all this sort of stuff with base R, for example, with functions like with, within, subset, transform, etc. Though the initial base R approach depicted is fairly concise, in general, it can potentially be:

  • more verbose
  • less legible
  • less amenable to additional data changes
  • requires esoteric knowledge (e.g. regular expressions)
  • often requires creation of new objects (even if we just want to explore)
  • often slower, possibly greatly

Running Example

The following data was scraped initially scraped from the web as follows. It is data from the NBA basketball league for the last season with things like player names, position, team name, points per game, field goal percentage, and various other statistics. We’ll use it as an example to demonstrate various functionality found within dplyr.

library(rvest)

current_year = lubridate::year(Sys.Date())

url = glue::glue("http://www.basketball-reference.com/leagues/NBA_{current_year-1}_totals.html")

bball = read_html(url) %>% 
  html_nodes("#totals_stats") %>% 
  html_table() %>% 
  data.frame() 

save(bball, file='data/bball.RData')

However you can just load it into your workspace as below. Note that when initially gathered from the website, the data is all character strings. We’ll fix this later. The following shows the data as it will eventually be.

load('data/bball.RData')
glimpse(bball[,1:5])
Rows: 734
Columns: 5
$ Rk     <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "16", "16", "17", "18", "19", "20", "Rk", "21", "22", "23", "23", "23", "24", "25", "26", "27", "28", "28", "28", "…
$ Player <chr> "Álex Abrines", "Quincy Acy", "Jaylen Adams", "Steven Adams", "Bam Adebayo", "Deng Adel", "DeVaughn Akoon-Purcell", "LaMarcus Aldridge", "Rawle Alkins", "Grayson Allen", "Jarrett Allen", "Kadeem Allen",…
$ Pos    <chr> "SG", "PF", "PG", "C", "C", "SF", "SG", "C", "SG", "SG", "C", "SG", "PF", "SF", "SF", "PF", "PF", "PF", "C", "PF", "PF", "PF", "Pos", "SF", "PG", "SF", "SF", "SF", "PG", "C", "SG", "PF", "SG", "SG", "SG…
$ Age    <chr> "25", "28", "22", "25", "21", "21", "25", "33", "21", "23", "20", "26", "28", "25", "25", "30", "30", "30", "20", "24", "21", "34", "Age", "21", "24", "33", "33", "33", "31", "20", "23", "19", "25", "25…
$ Tm     <chr> "OKC", "PHO", "ATL", "OKC", "MIA", "CLE", "DEN", "SAS", "CHI", "UTA", "BRK", "NYK", "POR", "ATL", "MEM", "TOT", "PHO", "MIA", "IND", "MIL", "DAL", "HOU", "Tm", "TOR", "CHI", "TOT", "PHO", "WAS", "ORL", …

Selecting Columns

Often you do not need the entire data set. While this is easily handled in base R (as shown earlier), it can be more clear to use select in dplyr. Now we won’t have to create separate objects, use quotes or $, etc.

bball %>% 
  select(Player, Tm, Pos) %>% 
  head()
        Player  Tm Pos
1 Álex Abrines OKC  SG
2   Quincy Acy PHO  PF
3 Jaylen Adams ATL  PG
4 Steven Adams OKC   C
5  Bam Adebayo MIA   C
6    Deng Adel CLE  SF

What if we want to drop some variables?

bball %>%     
  select(-Player, -Tm, -Pos)  %>% 
  head()
  Rk Age  G GS   MP  FG FGA  FG. X3P X3PA X3P. X2P X2PA X2P. eFG.  FT FTA   FT. ORB DRB TRB AST STL BLK TOV  PF  PTS
1  1  25 31  2  588  56 157 .357  41  127 .323  15   30 .500 .487  12  13  .923   5  43  48  20  17   6  14  53  165
2  2  28 10  0  123   4  18 .222   2   15 .133   2    3 .667 .278   7  10  .700   3  22  25   8   1   4   4  24   17
3  3  22 34  1  428  38 110 .345  25   74 .338  13   36 .361 .459   7   9  .778  11  49  60  65  14   5  28  45  108
4  4  25 80 80 2669 481 809 .595   0    2 .000 481  807 .596 .595 146 292  .500 391 369 760 124 117  76 135 204 1108
5  5  21 82 28 1913 280 486 .576   3   15 .200 277  471 .588 .579 166 226  .735 165 432 597 184  71  65 121 203  729
6  6  21 19  3  194  11  36 .306   6   23 .261   5   13 .385 .389   4   4 1.000   3  16  19   5   1   4   6  13   32

Helper functions

Sometimes, we have a lot of variables to select, and if they have a common naming scheme, this can be very easy.

bball %>% 
  select(Player, contains("3P"), ends_with("RB")) %>% 
  arrange(desc(TRB)) %>% 
  head()
  Player X3P X3PA X3P. ORB DRB TRB
1 Player  3P  3PA  3P% ORB DRB TRB
2 Player  3P  3PA  3P% ORB DRB TRB
3 Player  3P  3PA  3P% ORB DRB TRB
4 Player  3P  3PA  3P% ORB DRB TRB
5 Player  3P  3PA  3P% ORB DRB TRB
6 Player  3P  3PA  3P% ORB DRB TRB

The select also has helper functions to make selecting columns even easier. I probably don’t even need to explain what’s being done above, and this is the power of the tidyverse way. Here is the list of helper functions to be aware of:

  • starts_with: starts with a prefix
  • ends_with: ends with a suffix
  • contains: contains a literal string
  • matches: matches a regular expression
  • num_range: a numerical range like x01, x02, x03.
  • one_of: variables in character vector.
  • everything: all variables.

Filtering Rows

There are repeated header rows in this data3, so we need to drop them. This is also why everything was character string when we first scraped it, because having any character strings in a column coerces the entire column to be character, since all elements of a vector need to be of the same type. Character string is chosen over others because anything can be converted to a string, but not everything can be a number.

Filtering by rows requires the basic indexing knowledge we talked about before, especially Boolean indexing. In the following, Rk, or rank, is for all intents and purposes just a row id, but if it equals the actual text ‘Rk’ instead of something else, we know we’re dealing with a header row, so we’ll drop it.

bball = bball %>% 
  filter(Rk != "Rk")
  • filter returns rows with matching conditions.
  • slice allows for a numeric indexing approach4.

Say we want to look at forwards (SF or PF) over the age of 35. The following will do this, and since some players play on multiple teams, we’ll want only the unique information on the variables of interest. The function distinct allows us to do this.

bball %>% 
  filter(Age > 35, Pos == "SF" | Pos == "PF") %>% 
  distinct(Player, Pos, Age)              
         Player Pos Age
1  Vince Carter  PF  42
2   Kyle Korver  PF  37
3 Dirk Nowitzki  PF  40

Maybe we want just the first 10 rows. This is often the case when we perform some operation and need to quickly verify that what we’re doing is working in principle.

bball %>% 
  slice(1:10)
   Rk                 Player Pos Age  Tm  G GS   MP  FG  FGA  FG. X3P X3PA X3P. X2P X2PA X2P. eFG.  FT FTA   FT. ORB DRB TRB AST STL BLK TOV  PF  PTS
1   1           Álex Abrines  SG  25 OKC 31  2  588  56  157 .357  41  127 .323  15   30 .500 .487  12  13  .923   5  43  48  20  17   6  14  53  165
2   2             Quincy Acy  PF  28 PHO 10  0  123   4   18 .222   2   15 .133   2    3 .667 .278   7  10  .700   3  22  25   8   1   4   4  24   17
3   3           Jaylen Adams  PG  22 ATL 34  1  428  38  110 .345  25   74 .338  13   36 .361 .459   7   9  .778  11  49  60  65  14   5  28  45  108
4   4           Steven Adams   C  25 OKC 80 80 2669 481  809 .595   0    2 .000 481  807 .596 .595 146 292  .500 391 369 760 124 117  76 135 204 1108
5   5            Bam Adebayo   C  21 MIA 82 28 1913 280  486 .576   3   15 .200 277  471 .588 .579 166 226  .735 165 432 597 184  71  65 121 203  729
6   6              Deng Adel  SF  21 CLE 19  3  194  11   36 .306   6   23 .261   5   13 .385 .389   4   4 1.000   3  16  19   5   1   4   6  13   32
7   7 DeVaughn Akoon-Purcell  SG  25 DEN  7  0   22   3   10 .300   0    4 .000   3    6 .500 .300   1   2  .500   1   3   4   6   2   0   2   4    7
8   8      LaMarcus Aldridge   C  33 SAS 81 81 2687 684 1319 .519  10   42 .238 674 1277 .528 .522 349 412  .847 251 493 744 194  43 107 144 179 1727
9   9           Rawle Alkins  SG  21 CHI 10  1  120  13   39 .333   3   12 .250  10   27 .370 .372   8  12  .667  11  15  26  13   1   0   8   7   37
10 10          Grayson Allen  SG  23 UTA 38  2  416  67  178 .376  32   99 .323  35   79 .443 .466  45  60  .750   3  20  23  25   6   6  33  47  211

We can use filtering even with variables just created.

bball %>% 
  unite("posTeam", Pos, Tm) %>%         # create a new variable
  filter(posTeam == "SG_GSW") %>%       # use it for filtering
  select(Player, posTeam, Age) %>%      # use it for selection
  arrange(desc(Age))                    # descending order 
         Player posTeam Age
1 Klay Thompson  SG_GSW  28
2    Damion Lee  SG_GSW  26
3   Jacob Evans  SG_GSW  21

Being able to use a newly created variable on the fly, possibly only to filter or create some other variable, goes a long way toward easy visualization and generation of desired summary statistics.

Generating New Data

One of the most common data processing tasks is generating new variables. The function mutate takes a vector and returns one of the same dimension. In addition, there is mutate_at, mutate_if, and mutate_all to help with specific scenarios.

To demonstrate, we’ll use mutate_at to make appropriate columns numeric, i.e. everything except Player, Pos, and Tm. It takes two inputs, variables and functions to apply. As there are multiple variables and (potentially) multiple functions, we use the vars and funs functions to denote them5.

bball = bball %>% 
  mutate(across(c(-Player, -Pos, -Tm), as.numeric))

glimpse(bball[,1:7])
Rows: 708
Columns: 7
$ Rk     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 16, 16, 17, 18, 19, 20, 21, 22, 23, 23, 23, 24, 25, 26, 27, 28, 28, 28, 29, 30, 31, 32, 33, 33, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,…
$ Player <chr> "Álex Abrines", "Quincy Acy", "Jaylen Adams", "Steven Adams", "Bam Adebayo", "Deng Adel", "DeVaughn Akoon-Purcell", "LaMarcus Aldridge", "Rawle Alkins", "Grayson Allen", "Jarrett Allen", "Kadeem Allen",…
$ Pos    <chr> "SG", "PF", "PG", "C", "C", "SF", "SG", "C", "SG", "SG", "C", "SG", "PF", "SF", "SF", "PF", "PF", "PF", "C", "PF", "PF", "PF", "SF", "PG", "SF", "SF", "SF", "PG", "C", "SG", "PF", "SG", "SG", "SG", "PG"…
$ Age    <dbl> 25, 28, 22, 25, 21, 21, 25, 33, 21, 23, 20, 26, 28, 25, 25, 30, 30, 30, 20, 24, 21, 34, 21, 24, 33, 33, 33, 31, 20, 23, 19, 25, 25, 25, 22, 21, 20, 34, 26, 26, 26, 28, 23, 30, 30, 32, 29, 25, 22, 30, 32…
$ Tm     <chr> "OKC", "PHO", "ATL", "OKC", "MIA", "CLE", "DEN", "SAS", "CHI", "UTA", "BRK", "NYK", "POR", "ATL", "MEM", "TOT", "PHO", "MIA", "IND", "MIL", "DAL", "HOU", "TOR", "CHI", "TOT", "PHO", "WAS", "ORL", "PHO",…
$ G      <dbl> 31, 10, 34, 80, 82, 19, 7, 81, 10, 38, 80, 19, 81, 48, 43, 25, 15, 10, 3, 72, 2, 10, 67, 81, 69, 26, 43, 81, 71, 43, 62, 15, 11, 4, 16, 47, 47, 38, 77, 49, 28, 43, 30, 75, 34, 51, 67, 82, 81, 26, 79, 68…
$ GS     <dbl> 2, 0, 1, 80, 28, 3, 0, 81, 1, 2, 80, 1, 81, 4, 40, 8, 8, 0, 0, 72, 0, 2, 6, 32, 69, 26, 43, 81, 70, 13, 4, 0, 0, 0, 0, 45, 1, 0, 77, 49, 28, 38, 3, 72, 6, 18, 35, 82, 18, 2, 1, 3, 15, 27, 0, 12, 49, 1, …

Now that the data columns are of the correct type, the following demonstrates how we can use the standard mutate function to create composites of existing variables.

bball = bball %>% 
  mutate(
    trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
    effectiveFG  = (FG + (.5 * X3P)) / FGA,
    shootingDif  = trueShooting - FG.
  )

summary(select(bball, shootingDif))  # select and others don't have to be piped to use
  shootingDif      
 Min.   :-0.08561  
 1st Qu.: 0.06722  
 Median : 0.09829  
 Mean   : 0.09420  
 3rd Qu.: 0.12379  
 Max.   : 0.53192  
 NA's   :6         

Grouping and Summarizing Data

Another very common task is to look at group-based statistics, and we can use group_by and summarize to help us in this regard6. Base R has things like aggregate, by, and tapply for this, but they should not be used, as this approach is much more straightforward, flexible, and faster.

Conceptually we are doing a three-phase task: split, apply, combine. We split the data into subsets, apply a function, and then combine the results back into a single output. In applying a function, we may do any of the previously demonstrated tasks: calculate some statistic, generate new data, or even filter to a reduced part of the data.

For this demonstration, I’m going to start putting together several things we’ve demonstrated thus far. Ultimately we’ll create a variable called trueShooting, which represents ‘true shooting percentage’, and get an average for each position, and compare it to the average field goal percentage.

bball %>%   
  select(Pos, FG, FGA, FG., FTA, X3P, PTS) %>% 
  mutate(
    trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
    effectiveFG = (FG + (.5 * X3P)) / FGA,
    shootingDif = trueShooting - FG.
  ) %>%  
  group_by(Pos) %>%                                                 
  summarize(
    `Mean FG%` = mean(FG., na.rm = TRUE),
    `Mean True Shooting` = mean(trueShooting, na.rm = TRUE)
  )    
# A tibble: 11 x 3
   Pos   `Mean FG%` `Mean True Shooting`
   <chr>      <dbl>                <dbl>
 1 C          0.522                0.572
 2 C-PF       0.407                0.530
 3 PF         0.442                0.536
 4 PF-C       0.356                0.492
 5 PF-SF      0.419                0.544
 6 PG         0.409                0.512
 7 SF         0.425                0.529
 8 SF-SG      0.431                0.558
 9 SG         0.407                0.517
10 SG-PF      0.416                0.582
11 SG-SF      0.38                 0.466

We can do even more with grouped data. Specifically, we can create a new list-column in the data, the elements of which can be anything, even the results of an analysis for each group. As such, we can use tidyr’s unnest to get back to a standard data frame.

To demonstrate, the following will group data by position, then get the correlation between field-goal percentage and free-throw shooting percentage. Some players are listed with multiple positions, so we will reduce those to whatever their first position is using case_when.

bball %>% 
  mutate(
    Pos = case_when(
      Pos == 'PG-SG' ~ 'PG',
      Pos == 'C-PF'  ~ 'C',
      Pos == 'SF-SG' ~ 'SF',
      Pos == 'PF-C'  | Pos == 'PF-SF' ~ 'PF',
      Pos == 'SG-PF' | Pos == 'SG-SF' ~ 'SG',
      TRUE ~ Pos
    )) %>% 
  nest_by(Pos) %>%     
  mutate(FgFt_Corr = list(cor(data$FG., data$FT., use = 'complete'))) %>% 
  unnest(c(Pos, FgFt_Corr))
# A tibble: 5 x 3
# Groups:   Pos [5]
  Pos                  data FgFt_Corr
  <chr> <list<tbl_df[,32]>>     <dbl>
1 C              [121 × 32]  -0.122  
2 PF             [150 × 32]  -0.0186 
3 PG             [139 × 32]   0.0857 
4 SF             [120 × 32]   0.00422
5 SG             [178 × 32]  -0.0585 

As a reminder, data frames are lists. As such, anything can go into the ‘columns’, even regression models!

library(nycflights13)

carriers = group_by(flights, carrier)

group_size(carriers)  # if you're curious, there is a function to quickly get group Ns
 [1] 18460 32729   714 54635 48110 54173   685  3260   342 26397    32 58665 20536  5162 12275   601
mods = flights %>% 
  nest_by(carrier) %>% 
  mutate(model = list(lm(arr_delay ~ dep_time, data = data)) )

mods
# A tibble: 16 x 3
# Rowwise:  carrier
   carrier                data model 
   <chr>   <list<tbl_df[,18]>> <list>
 1 9E            [18,460 × 18] <lm>  
 2 AA            [32,729 × 18] <lm>  
 3 AS               [714 × 18] <lm>  
 4 B6            [54,635 × 18] <lm>  
 5 DL            [48,110 × 18] <lm>  
 6 EV            [54,173 × 18] <lm>  
 7 F9               [685 × 18] <lm>  
 8 FL             [3,260 × 18] <lm>  
 9 HA               [342 × 18] <lm>  
10 MQ            [26,397 × 18] <lm>  
11 OO                [32 × 18] <lm>  
12 UA            [58,665 × 18] <lm>  
13 US            [20,536 × 18] <lm>  
14 VX             [5,162 × 18] <lm>  
15 WN            [12,275 × 18] <lm>  
16 YV               [601 × 18] <lm>  
mods %>%
  summarize(
    carrier = carrier,
    `Adjusted Rsq` = summary(model)$adj.r.squared,
    coef_dep_time = coef(model)[2]
  )
# A tibble: 16 x 3
# Groups:   carrier [16]
   carrier `Adjusted Rsq` coef_dep_time
   <chr>            <dbl>         <dbl>
 1 9E             0.0513         0.0252
 2 AA             0.0504         0.0209
 3 AS             0.0815         0.0186
 4 B6             0.0241         0.0120
 5 DL             0.0347         0.0179
 6 EV             0.0836         0.0290
 7 F9             0.0998         0.0484
 8 FL             0.0261         0.0183
 9 HA            -0.00124       -0.0578
10 MQ             0.0499         0.0218
11 OO            -0.0189         0.0394
12 UA             0.0673         0.0220
13 US             0.0575         0.0174
14 VX             0.111          0.0362
15 WN             0.119          0.0345
16 YV             0.137          0.0805

You can use group_by on more than one variable, e.g. group_by(var1, var2)

Renaming Columns

Tibbles in the tidyverse don’t really have a problem with variable names starting with numbers or incorporating symbols and spaces. I would still suggest it is poor practice, because even if your data set looks fine, you’ll possibly encounter problems with modeling and visualization packages using that data. However, as a demonstration, we can ‘fix’ some of the variable names.

One issue is that when we scraped the data and converted it to a data.frame, the names that started with a number, like 3P for ‘three point baskets made’, were made into X3P, because that’s the way R works by default. In addition, 3P%, i.e. three point percentage made, was made into 3P. with a dot for the percent sign. Same goes for the 2P (two-pointers) and FT (free-throw) variables.

We can use rename to change column names. A basic example is as follows.

data %>% 
  rename(new_name = old_name,
         new_name2 = old_name2)

Very straightforward. However, oftentimes we’ll need to change patterns, as with our current problem. The following uses str_replace and str_remove from stringr to look for a pattern in a name, and replace that pattern with some other pattern. It uses regular expressions for the patterns.

bball = bball %>%
  rename_with(
    str_replace,      # function
    contains('.'),    # columns
    pattern = '\\.',  # function arguments
    replacement = '%'
  ) %>% 
  rename_with(str_remove, starts_with('X'), pattern = 'X')

colnames(bball)
 [1] "Rk"           "Player"       "Pos"          "Age"          "Tm"           "G"            "GS"           "MP"           "FG"           "FGA"          "FG%"          "3P"           "3PA"          "3P%"         
[15] "2P"           "2PA"          "2P%"          "eFG%"         "FT"           "FTA"          "FT%"          "ORB"          "DRB"          "TRB"          "AST"          "STL"          "BLK"          "TOV"         
[29] "PF"           "PTS"          "trueShooting" "effectiveFG"  "shootingDif" 

Merging Data

Merging data is yet another very common data task, as data often comes from multiple sources. In order to do this, we need some common identifier among the sources by which to join them. The following is a list of dplyr join functions.

inner_join: return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

left_join: return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

right_join: return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

semi_join: return all rows from x where there are matching values in y, keeping just columns from x. It differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

anti_join: return all rows from x where there are not matching values in y, keeping just columns from x.

full_join: return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Probably the most common is a left join, where we have one primary data set, and are adding data from another source to it while retaining it as a base. The following is a simple demonstration.

band_members
# A tibble: 3 x 2
  Name    Band        
  <chr>   <chr>       
1 Seth    Com Truise  
2 Francis Pixies      
3 Bubba   The New Year
band_instruments
# A tibble: 3 x 2
  Name    Instrument 
  <chr>   <chr>      
1 Francis Guitar     
2 Bubba   Guitar     
3 Seth    Synthesizer
left_join(band_members, band_instruments)
Joining, by = "Name"
# A tibble: 3 x 3
  Name    Band         Instrument 
  <chr>   <chr>        <chr>      
1 Seth    Com Truise   Synthesizer
2 Francis Pixies       Guitar     
3 Bubba   The New Year Guitar     

When we don’t have a one to one match, the result of the different types of join will become more apparent.

band_members
# A tibble: 4 x 2
  Name    Band        
  <chr>   <chr>       
1 Seth    Com Truise  
2 Francis Pixies      
3 Bubba   The New Year
4 Stephen Pavement    
band_instruments
# A tibble: 4 x 2
  Name    Instrument 
  <chr>   <chr>      
1 Seth    Synthesizer
2 Francis Guitar     
3 Bubba   Guitar     
4 Steve   Rage       
left_join(band_members, band_instruments)  
Joining, by = "Name"
# A tibble: 4 x 3
  Name    Band         Instrument 
  <chr>   <chr>        <chr>      
1 Seth    Com Truise   Synthesizer
2 Francis Pixies       Guitar     
3 Bubba   The New Year Guitar     
4 Stephen Pavement     <NA>       
right_join(band_members, band_instruments)
Joining, by = "Name"
# A tibble: 4 x 3
  Name    Band         Instrument 
  <chr>   <chr>        <chr>      
1 Seth    Com Truise   Synthesizer
2 Francis Pixies       Guitar     
3 Bubba   The New Year Guitar     
4 Steve   <NA>         Rage       
inner_join(band_members, band_instruments)
Joining, by = "Name"
# A tibble: 3 x 3
  Name    Band         Instrument 
  <chr>   <chr>        <chr>      
1 Seth    Com Truise   Synthesizer
2 Francis Pixies       Guitar     
3 Bubba   The New Year Guitar     
full_join(band_members, band_instruments)
Joining, by = "Name"
# A tibble: 5 x 3
  Name    Band         Instrument 
  <chr>   <chr>        <chr>      
1 Seth    Com Truise   Synthesizer
2 Francis Pixies       Guitar     
3 Bubba   The New Year Guitar     
4 Stephen Pavement     <NA>       
5 Steve   <NA>         Rage       
anti_join(band_members, band_instruments)
Joining, by = "Name"
# A tibble: 1 x 2
  Name    Band    
  <chr>   <chr>   
1 Stephen Pavement
anti_join(band_instruments, band_members)
Joining, by = "Name"
# A tibble: 1 x 2
  Name  Instrument
  <chr> <chr>     
1 Steve Rage      

Merges can get quite complex, and involve multiple data sources. In many cases you may have to do a lot of processing before getting to the merge, but dplyr’s joins will help quite a bit.

Pivoting axes

The tidyr package can be thought of as a specialized subset of dplyr’s functionality, as well as an update to the previous reshape and reshape2 packages7. Some of its functions for manipulating data you’ll want to be familiar with are:

  • pivot_longer: convert data from a wider format to longer one
  • pivot_wider: convert data from a longer format to wider one
  • unite: paste together multiple columns into one
  • separate: complement of unite
  • unnest: expand ‘list columns’

The following example shows how we take a ‘wide-form’ data set, where multiple columns represent different stock prices, and turn it into two columns, one representing stock name, and one for the price. We need to know which columns to work on, which is the first entry. This function works very much like select, where you can use helpers. Then we need to give a name to the column(s) representing the indicators of what were multiple columns in the wide format. And finally we need to specify the column(s) of the values.

library(tidyr)

stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)

stocks %>% head
        time           X          Y          Z
1 2009-01-01 -1.23994442 -4.8515935  3.7985281
2 2009-01-02  0.65851483  0.9552487 -2.7255786
3 2009-01-03 -0.91146059 -0.0321312  0.6175274
4 2009-01-04  1.85598621  1.1919978 -2.4837558
5 2009-01-05  0.37266866  0.6297287 -1.1330732
6 2009-01-06 -0.06072664 -2.8673242  1.7155168
stocks %>% 
  pivot_longer(
    cols      = -time,   # works similar to using select()
    names_to  = 'stock', # the name of the column that will have column names as labels
    values_to = 'price'  # the name of the column for the values
  ) %>% 
  head()
# A tibble: 6 x 3
  time       stock  price
  <date>     <chr>  <dbl>
1 2009-01-01 X     -1.24 
2 2009-01-01 Y     -4.85 
3 2009-01-01 Z      3.80 
4 2009-01-02 X      0.659
5 2009-01-02 Y      0.955
6 2009-01-02 Z     -2.73 

Here is a more complex example where we can handle multiple repeated entries. We additionally add another column for labeling, and posit the separator for the column names.

library(tidyr)

stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X_1 = rnorm(10, 0, 1),
  X_2 = rnorm(10, 0, 1),
  Y_1 = rnorm(10, 0, 2),
  Y_2 = rnorm(10, 0, 2),
  Z_1 = rnorm(10, 0, 4),
  Z_2 = rnorm(10, 0, 4)
)

head(stocks)
        time        X_1         X_2        Y_1         Y_2        Z_1        Z_2
1 2009-01-01 -0.9675529 -0.72793192  0.7516393  0.03321408  3.7485540  0.3945022
2 2009-01-02 -0.1780449  0.08926355 -0.1976137  1.53569057 -0.0315400  7.6285628
3 2009-01-03  0.2958189  0.38118235  1.6730362 -1.13635638  0.1543268 -5.9254785
4 2009-01-04 -0.7805814 -0.67370673 -0.5696378 -3.62905335 -2.4256959  6.6867209
5 2009-01-05  1.7910958 -0.32353046 -1.6786235 -1.55989831 -4.4294289 -8.1844866
6 2009-01-06  1.1623828 -0.27362716 -0.3116307  2.73462718  0.6675895  1.9884072
stocks %>% 
  pivot_longer(
    cols = -time,
    names_to = c('stock', 'entry'),
    names_sep = '_',
    values_to = 'price'
  ) %>% 
  head()
# A tibble: 6 x 4
  time       stock entry   price
  <date>     <chr> <chr>   <dbl>
1 2009-01-01 X     1     -0.968 
2 2009-01-01 X     2     -0.728 
3 2009-01-01 Y     1      0.752 
4 2009-01-01 Y     2      0.0332
5 2009-01-01 Z     1      3.75  
6 2009-01-01 Z     2      0.395 

Note that the latter is an example of tidy data while the former is not. Why do we generally prefer such data? Precisely because the most common data operations, grouping, filtering, etc., would work notably more efficiently with such data. This is especially the case for visualization.

The following demonstrates the separate function utilized for a very common data processing task- dealing with names. Here’ we’ll separate player into first and last names based on the space.

bball %>% 
  separate(Player, into=c('first_name', 'last_name'), sep=' ') %>% 
  select(1:5) %>% 
  head()
  Rk first_name last_name Pos Age
1  1       Álex   Abrines  SG  25
2  2     Quincy       Acy  PF  28
3  3     Jaylen     Adams  PG  22
4  4     Steven     Adams   C  25
5  5        Bam   Adebayo   C  21
6  6       Deng      Adel  SF  21

Note that this won’t necessarily apply to every name, so further processing may be required.

More Tidyverse

  • dplyr functions: There are over a hundred utility functions that perform very common tasks. You really need to be aware of them, as their use will come up often.

  • broom: Convert statistical analysis objects from R into tidy data frames, so that they can more easily be combined, reshaped and otherwise processed with tools like dplyr, tidyr and ggplot2.

  • tidy*: a lot of packages out there are now ‘tidy’, though not a part of the official tidyverse. Some examples of the ones I’ve used:

    • tidycensus
    • tidybayes
    • tidytext
    • modelr

Seriously, there are a lot.

Personal Opinion

The dplyr grammar is clear for a lot of standard data processing tasks, and some not so common.

Extremely useful for data exploration and visualization.

  • No need to create/overwrite existing objects
  • Can overwrite columns and use as they are created
  • Makes it easy to look at anything, and do otherwise tedious data checks

Drawbacks:

  • Not as fast as data.table or even some base R approaches for many things8
  • The mindset can make for unnecessary complication
    • e.g. There is no need to pipe to create a single new variable
  • Some approaches, are not very intuitive
  • Notably less ability to work with some very common data structures (e.g. matrices)

All in all, if you’ve only been using base R approaches, the tidyverse will change your R life! It makes all the sorts of things you do all the time easier and clearer. Highly recommended!

Tidyverse Exercises

Exercise 0

Install and load the dplyr ggplot2movies packages. Look at the help file for the movies data set, which contains data from IMDB.

install.packages('ggplot2movies')
library(ggplot2movies)
data('movies')

Exercise 1

Using the movies data set, perform each of the following actions separately.

Exercise 1a

Use mutate to create a centered version of the rating variable. A centered variable is one whose mean has been subtracted from it. The process will take the following form:

data %>% 
  mutate(new_var_name = '?')

Exercise 1b

Use filter to create a new data frame that has only movies from the years 2000 and beyond. Use the greater than or equal operator >=.

Exercise 1c

Use select to create a new data frame that only has the title, year, budget, length, rating and votes variables. There are at least 3 ways to do this.

Exercise 1d

Rename the length column to length_in_min (i.e. length in minutes).

Exercise 2

Use group_by to group the data by year, and summarize to create a new variable that is the average budget. The summarize function works just like mutate in this case.

Use the mean function to get the average, but you’ll also need to use the argument na.rm = TRUE within it because the earliest years have no budget recorded.

Exercise 3

Use pivot_longer to create a ‘tidy’ data set from the following.

dat = tibble(id = 1:10,
             x = rnorm(10),
             y = rnorm(10))

Exercise 4

Now put several actions together in one set of piped operations.

  • Filter movies released after 1990
  • select the same variables as before but also the mpaa, Action, and Drama variables
  • group by mpaa and (your choice) Action or Drama
  • get the average rating

It should spit out something like the following:

# A tibble: 10 x 3
# Groups:   mpaa [5]
   mpaa    Drama AvgRating
   <chr>   <int>     <dbl>
 1 ""          0      5.94
 2 ""          1      6.20
 3 "NC-17"     0      4.28
 4 "NC-17"     1      4.62
 5 "PG"        0      5.19
 6 "PG"        1      6.15
 7 "PG-13"     0      5.44
 8 "PG-13"     1      6.14
 9 "R"         0      4.86
10 "R"         1      5.94

Python Pandas Notebook

Available on GitHub


  1. You may be thinking- ‘it’s 2020 and why on earth would anyone do that?!’. Peruse most sports websites and you’ll see that fundamental web design basics escape them. See also, financial sites.↩︎

  2. If you’re following good programming practices, you’ll likely never use slice.↩︎

  3. For more than one function, you could have supplied `funs(function1, function2, …).↩︎

  4. As Hadley Wickham is from New Zealand, and his examples use summarise, you’ll probably see it about as much as you do the other spelling, especially since it will come up first in autocomplete.↩︎

  5. Some still use reshape2 but there is no reason to and it is no longer developed.↩︎

  6. There is multidplyr, dtplyr, and now tidyfast to help get as fast as possible for large data processing.↩︎