Tidyverse
What is the Tidyverse?
The tidyverse consists of a few key packages:
- ggplot2: data visualization
- tibble: tibbles, a modern re-imagining of data frames
- tidyr: data tidying
- readr: data import
- purrr: functional programming, e.g. alternate approaches to apply
- dplyr: data manipulation
And of course the tidyverse package itself, which will load all of the above in a way that will avoid naming conflicts.
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages -------------------------
filter(): dplyr, stats
lag(): dplyr, stats
In addition, there are other packages like lubridate, rvest, stringr and others in the hadleyverse that are also greatly useful.
What is Tidy?
Tidy data refers to data arranged in a way that makes data processing, analysis, and visualization simpler.
In a tidy data set:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Think long before wide.
dplyr
dplyr provides a grammar of data manipulation (like ggplot2 does for visualization). It is the next iteration of plyr, but plyr is deprecated and no longer used. It’s focused on tools for working with data frames, with over 100 functions that might be of specific use to you.
It has three main goals:
- Make the most important data manipulation tasks easier.
- Do them faster.
- Use the same interface to work with data frames, data tables or a database.
Some key operations include:
- select: grab columns
- select helpers: one_of, starts_with, num_range etc.
- filter/slice: grab rows
- group_by: grouped operations
- mutate/transmute: create new variables
- summarize: summarize/aggregate
There are various (SQL-like) join/merge functions:
- inner_join, left_join etc.
And there are a lot of little things like:
- n, n_distinct, nth, n_groups, count, recode, between
In addition, there is no need to quote variable names.
An example
Let’s say we want to select from our data the following variables:
- Start with the ID variable
- The variables X1 through X10, which are not all grouped together, and there are many more X* columns
- The variables var1 and var2, which are the only variables with var in their name
- Any variable with a name that starts with XYZ
How might we go about this in a dataset of possibly hundreds or even thousands of columns? There are several base R approaches that we could go with, but often they will be tedious, or require multiple objects to be created just to get the columns you want. Let’s start with the worst choice.
Using numeric indexes, or rather magic numbers, is not conducive to readability or reproducibility. If anything changes about the data columns, the numbers may no longer be applicable, and you’d have to redo the line again.
We could name the variables explicitly.
This would be fine if there are only a handful. But if you’re trying to reduce a 1000 column data set to several dozen it’s tedious, and generally not pretty regardless.
A more advanced alternative regards a two-step approach with regular expressions. This requires that you know something about regex (and you should), but it is difficult to read/understand by those who don’t, and often by even yourself if it’s more complicated. In any case, you first will need to create an object that represents the column names first, otherwise it looks unwieldy if used within brackets or a function like subset.
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]
# or via subset
newData = subset(oldData, select = cols)
Now consider there is even more to do. What if you also want observations where Z is Yes, Q is No, and only the observations with the top 50 values of var2, ordered by var1 (descending)? Probably the more straightforward way in R to do so would be something like the following, where each part is broken out and we continuously write over the object as we modify it.
# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]
And this is for fairly straightforward operations.
Now consider doing all of the previous in one piped operation. The dplyr package will allow us to do something like the following.
newData = oldData %>%
select(num_range('X', 1:10),
contains('var'),
starts_with('XYZ')) %>%
filter(Z == 'Yes',
Q == 'No') %>%
top_n(n=50, var2) %>%
arrange(desc(var1))
Even if it hadn’t been explained before, you might have been able to guess a little as to what was going on. The code is fairly succinct, we don’t have to keep referencing objects repeatedly, and no explicit intermediary objects are created.
dplyr and piping is an alternative. You can do all this sort of stuff with base R, for example, with functions like with, within, subset, transform, etc. Though the initial base R approach depicted is fairly concise, in general, it can potentially be:
- more verbose
- less legible
- less amenable to additional data changes
- requires esoteric knowledge (e.g. regular expressions)
- often requires creation of new objects (even if we just want to explore)
- often slower, possibly greatly
Running Example
The following data was scraped initially scraped from the web as follows. It is data from the NBA basketball league for the last season with things like player names, position, team name, points per game, field goal percentage, and various other statistics. We’ll use it as an example to demonstrate various functionality found within dplyr.
library(rvest)
current_year = lubridate::year(Sys.Date())
url = glue::glue("http://www.basketball-reference.com/leagues/NBA_{current_year-1}_totals.html")
bball = read_html(url) %>%
html_nodes("#totals_stats") %>%
html_table() %>%
data.frame()
save(bball, file='data/bball.RData')
However you can just load it into your workspace as below. Note that when initially gathered from the website, the data is all character strings. We’ll fix this later. The following shows the data as it will eventually be.
Rows: 734
Columns: 5
$ Rk <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "16", "16", "17", "18", "19", "20", "Rk", "21", "22", "23", "23", "23", "24", "25", "26", "27", "28", "28", "28", "…
$ Player <chr> "Álex Abrines", "Quincy Acy", "Jaylen Adams", "Steven Adams", "Bam Adebayo", "Deng Adel", "DeVaughn Akoon-Purcell", "LaMarcus Aldridge", "Rawle Alkins", "Grayson Allen", "Jarrett Allen", "Kadeem Allen",…
$ Pos <chr> "SG", "PF", "PG", "C", "C", "SF", "SG", "C", "SG", "SG", "C", "SG", "PF", "SF", "SF", "PF", "PF", "PF", "C", "PF", "PF", "PF", "Pos", "SF", "PG", "SF", "SF", "SF", "PG", "C", "SG", "PF", "SG", "SG", "SG…
$ Age <chr> "25", "28", "22", "25", "21", "21", "25", "33", "21", "23", "20", "26", "28", "25", "25", "30", "30", "30", "20", "24", "21", "34", "Age", "21", "24", "33", "33", "33", "31", "20", "23", "19", "25", "25…
$ Tm <chr> "OKC", "PHO", "ATL", "OKC", "MIA", "CLE", "DEN", "SAS", "CHI", "UTA", "BRK", "NYK", "POR", "ATL", "MEM", "TOT", "PHO", "MIA", "IND", "MIL", "DAL", "HOU", "Tm", "TOR", "CHI", "TOT", "PHO", "WAS", "ORL", …
Selecting Columns
Often you do not need the entire data set. While this is easily handled in base R (as shown earlier), it can be more clear to use select in dplyr. Now we won’t have to create separate objects, use quotes or $, etc.
Player Tm Pos
1 Álex Abrines OKC SG
2 Quincy Acy PHO PF
3 Jaylen Adams ATL PG
4 Steven Adams OKC C
5 Bam Adebayo MIA C
6 Deng Adel CLE SF
What if we want to drop some variables?
Rk Age G GS MP FG FGA FG. X3P X3PA X3P. X2P X2PA X2P. eFG. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS
1 1 25 31 2 588 56 157 .357 41 127 .323 15 30 .500 .487 12 13 .923 5 43 48 20 17 6 14 53 165
2 2 28 10 0 123 4 18 .222 2 15 .133 2 3 .667 .278 7 10 .700 3 22 25 8 1 4 4 24 17
3 3 22 34 1 428 38 110 .345 25 74 .338 13 36 .361 .459 7 9 .778 11 49 60 65 14 5 28 45 108
4 4 25 80 80 2669 481 809 .595 0 2 .000 481 807 .596 .595 146 292 .500 391 369 760 124 117 76 135 204 1108
5 5 21 82 28 1913 280 486 .576 3 15 .200 277 471 .588 .579 166 226 .735 165 432 597 184 71 65 121 203 729
6 6 21 19 3 194 11 36 .306 6 23 .261 5 13 .385 .389 4 4 1.000 3 16 19 5 1 4 6 13 32
Helper functions
Sometimes, we have a lot of variables to select, and if they have a common naming scheme, this can be very easy.
Player X3P X3PA X3P. ORB DRB TRB
1 Player 3P 3PA 3P% ORB DRB TRB
2 Player 3P 3PA 3P% ORB DRB TRB
3 Player 3P 3PA 3P% ORB DRB TRB
4 Player 3P 3PA 3P% ORB DRB TRB
5 Player 3P 3PA 3P% ORB DRB TRB
6 Player 3P 3PA 3P% ORB DRB TRB
The select also has helper functions to make selecting columns even easier. I probably don’t even need to explain what’s being done above, and this is the power of the tidyverse way. Here is the list of helper functions to be aware of:
- starts_with: starts with a prefix
- ends_with: ends with a suffix
- contains: contains a literal string
- matches: matches a regular expression
- num_range: a numerical range like x01, x02, x03.
- one_of: variables in character vector.
- everything: all variables.
Filtering Rows
There are repeated header rows in this data3, so we need to drop them. This is also why everything was character string when we first scraped it, because having any character strings in a column coerces the entire column to be character, since all elements of a vector need to be of the same type. Character string is chosen over others because anything can be converted to a string, but not everything can be a number.
Filtering by rows requires the basic indexing knowledge we talked about before, especially Boolean indexing. In the following, Rk
, or rank, is for all intents and purposes just a row id, but if it equals the actual text ‘Rk’ instead of something else, we know we’re dealing with a header row, so we’ll drop it.
- filter returns rows with matching conditions.
- slice allows for a numeric indexing approach4.
Say we want to look at forwards (SF or PF) over the age of 35. The following will do this, and since some players play on multiple teams, we’ll want only the unique information on the variables of interest. The function distinct allows us to do this.
Player Pos Age
1 Vince Carter PF 42
2 Kyle Korver PF 37
3 Dirk Nowitzki PF 40
Maybe we want just the first 10 rows. This is often the case when we perform some operation and need to quickly verify that what we’re doing is working in principle.
Rk Player Pos Age Tm G GS MP FG FGA FG. X3P X3PA X3P. X2P X2PA X2P. eFG. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS
1 1 Álex Abrines SG 25 OKC 31 2 588 56 157 .357 41 127 .323 15 30 .500 .487 12 13 .923 5 43 48 20 17 6 14 53 165
2 2 Quincy Acy PF 28 PHO 10 0 123 4 18 .222 2 15 .133 2 3 .667 .278 7 10 .700 3 22 25 8 1 4 4 24 17
3 3 Jaylen Adams PG 22 ATL 34 1 428 38 110 .345 25 74 .338 13 36 .361 .459 7 9 .778 11 49 60 65 14 5 28 45 108
4 4 Steven Adams C 25 OKC 80 80 2669 481 809 .595 0 2 .000 481 807 .596 .595 146 292 .500 391 369 760 124 117 76 135 204 1108
5 5 Bam Adebayo C 21 MIA 82 28 1913 280 486 .576 3 15 .200 277 471 .588 .579 166 226 .735 165 432 597 184 71 65 121 203 729
6 6 Deng Adel SF 21 CLE 19 3 194 11 36 .306 6 23 .261 5 13 .385 .389 4 4 1.000 3 16 19 5 1 4 6 13 32
7 7 DeVaughn Akoon-Purcell SG 25 DEN 7 0 22 3 10 .300 0 4 .000 3 6 .500 .300 1 2 .500 1 3 4 6 2 0 2 4 7
8 8 LaMarcus Aldridge C 33 SAS 81 81 2687 684 1319 .519 10 42 .238 674 1277 .528 .522 349 412 .847 251 493 744 194 43 107 144 179 1727
9 9 Rawle Alkins SG 21 CHI 10 1 120 13 39 .333 3 12 .250 10 27 .370 .372 8 12 .667 11 15 26 13 1 0 8 7 37
10 10 Grayson Allen SG 23 UTA 38 2 416 67 178 .376 32 99 .323 35 79 .443 .466 45 60 .750 3 20 23 25 6 6 33 47 211
We can use filtering even with variables just created.
bball %>%
unite("posTeam", Pos, Tm) %>% # create a new variable
filter(posTeam == "SG_GSW") %>% # use it for filtering
select(Player, posTeam, Age) %>% # use it for selection
arrange(desc(Age)) # descending order
Player posTeam Age
1 Klay Thompson SG_GSW 28
2 Damion Lee SG_GSW 26
3 Jacob Evans SG_GSW 21
Being able to use a newly created variable on the fly, possibly only to filter or create some other variable, goes a long way toward easy visualization and generation of desired summary statistics.
Generating New Data
One of the most common data processing tasks is generating new variables. The function mutate takes a vector and returns one of the same dimension. In addition, there is mutate_at, mutate_if, and mutate_all to help with specific scenarios.
To demonstrate, we’ll use mutate_at to make appropriate columns numeric, i.e. everything except Player
, Pos
, and Tm
. It takes two inputs, variables and functions to apply. As there are multiple variables and (potentially) multiple functions, we use the vars and funs functions to denote them5.
Rows: 708
Columns: 7
$ Rk <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 16, 16, 17, 18, 19, 20, 21, 22, 23, 23, 23, 24, 25, 26, 27, 28, 28, 28, 29, 30, 31, 32, 33, 33, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,…
$ Player <chr> "Álex Abrines", "Quincy Acy", "Jaylen Adams", "Steven Adams", "Bam Adebayo", "Deng Adel", "DeVaughn Akoon-Purcell", "LaMarcus Aldridge", "Rawle Alkins", "Grayson Allen", "Jarrett Allen", "Kadeem Allen",…
$ Pos <chr> "SG", "PF", "PG", "C", "C", "SF", "SG", "C", "SG", "SG", "C", "SG", "PF", "SF", "SF", "PF", "PF", "PF", "C", "PF", "PF", "PF", "SF", "PG", "SF", "SF", "SF", "PG", "C", "SG", "PF", "SG", "SG", "SG", "PG"…
$ Age <dbl> 25, 28, 22, 25, 21, 21, 25, 33, 21, 23, 20, 26, 28, 25, 25, 30, 30, 30, 20, 24, 21, 34, 21, 24, 33, 33, 33, 31, 20, 23, 19, 25, 25, 25, 22, 21, 20, 34, 26, 26, 26, 28, 23, 30, 30, 32, 29, 25, 22, 30, 32…
$ Tm <chr> "OKC", "PHO", "ATL", "OKC", "MIA", "CLE", "DEN", "SAS", "CHI", "UTA", "BRK", "NYK", "POR", "ATL", "MEM", "TOT", "PHO", "MIA", "IND", "MIL", "DAL", "HOU", "TOR", "CHI", "TOT", "PHO", "WAS", "ORL", "PHO",…
$ G <dbl> 31, 10, 34, 80, 82, 19, 7, 81, 10, 38, 80, 19, 81, 48, 43, 25, 15, 10, 3, 72, 2, 10, 67, 81, 69, 26, 43, 81, 71, 43, 62, 15, 11, 4, 16, 47, 47, 38, 77, 49, 28, 43, 30, 75, 34, 51, 67, 82, 81, 26, 79, 68…
$ GS <dbl> 2, 0, 1, 80, 28, 3, 0, 81, 1, 2, 80, 1, 81, 4, 40, 8, 8, 0, 0, 72, 0, 2, 6, 32, 69, 26, 43, 81, 70, 13, 4, 0, 0, 0, 0, 45, 1, 0, 77, 49, 28, 38, 3, 72, 6, 18, 35, 82, 18, 2, 1, 3, 15, 27, 0, 12, 49, 1, …
Now that the data columns are of the correct type, the following demonstrates how we can use the standard mutate function to create composites of existing variables.
bball = bball %>%
mutate(
trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
effectiveFG = (FG + (.5 * X3P)) / FGA,
shootingDif = trueShooting - FG.
)
summary(select(bball, shootingDif)) # select and others don't have to be piped to use
shootingDif
Min. :-0.08561
1st Qu.: 0.06722
Median : 0.09829
Mean : 0.09420
3rd Qu.: 0.12379
Max. : 0.53192
NA's :6
Grouping and Summarizing Data
Another very common task is to look at group-based statistics, and we can use group_by and summarize to help us in this regard6. Base R has things like aggregate, by, and tapply for this, but they should not be used, as this approach is much more straightforward, flexible, and faster.
Conceptually we are doing a three-phase task: split, apply, combine. We split the data into subsets, apply a function, and then combine the results back into a single output. In applying a function, we may do any of the previously demonstrated tasks: calculate some statistic, generate new data, or even filter to a reduced part of the data.
For this demonstration, I’m going to start putting together several things we’ve demonstrated thus far. Ultimately we’ll create a variable called trueShooting, which represents ‘true shooting percentage’, and get an average for each position, and compare it to the average field goal percentage.
bball %>%
select(Pos, FG, FGA, FG., FTA, X3P, PTS) %>%
mutate(
trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
effectiveFG = (FG + (.5 * X3P)) / FGA,
shootingDif = trueShooting - FG.
) %>%
group_by(Pos) %>%
summarize(
`Mean FG%` = mean(FG., na.rm = TRUE),
`Mean True Shooting` = mean(trueShooting, na.rm = TRUE)
)
# A tibble: 11 x 3
Pos `Mean FG%` `Mean True Shooting`
<chr> <dbl> <dbl>
1 C 0.522 0.572
2 C-PF 0.407 0.530
3 PF 0.442 0.536
4 PF-C 0.356 0.492
5 PF-SF 0.419 0.544
6 PG 0.409 0.512
7 SF 0.425 0.529
8 SF-SG 0.431 0.558
9 SG 0.407 0.517
10 SG-PF 0.416 0.582
11 SG-SF 0.38 0.466
We can do even more with grouped data. Specifically, we can create a new list-column in the data, the elements of which can be anything, even the results of an analysis for each group. As such, we can use tidyr’s unnest to get back to a standard data frame.
To demonstrate, the following will group data by position, then get the correlation between field-goal percentage and free-throw shooting percentage. Some players are listed with multiple positions, so we will reduce those to whatever their first position is using case_when.
bball %>%
mutate(
Pos = case_when(
Pos == 'PG-SG' ~ 'PG',
Pos == 'C-PF' ~ 'C',
Pos == 'SF-SG' ~ 'SF',
Pos == 'PF-C' | Pos == 'PF-SF' ~ 'PF',
Pos == 'SG-PF' | Pos == 'SG-SF' ~ 'SG',
TRUE ~ Pos
)) %>%
nest_by(Pos) %>%
mutate(FgFt_Corr = list(cor(data$FG., data$FT., use = 'complete'))) %>%
unnest(c(Pos, FgFt_Corr))
# A tibble: 5 x 3
# Groups: Pos [5]
Pos data FgFt_Corr
<chr> <list<tbl_df[,32]>> <dbl>
1 C [121 × 32] -0.122
2 PF [150 × 32] -0.0186
3 PG [139 × 32] 0.0857
4 SF [120 × 32] 0.00422
5 SG [178 × 32] -0.0585
As a reminder, data frames are lists. As such, anything can go into the ‘columns’, even regression models!
library(nycflights13)
carriers = group_by(flights, carrier)
group_size(carriers) # if you're curious, there is a function to quickly get group Ns
[1] 18460 32729 714 54635 48110 54173 685 3260 342 26397 32 58665 20536 5162 12275 601
mods = flights %>%
nest_by(carrier) %>%
mutate(model = list(lm(arr_delay ~ dep_time, data = data)) )
mods
# A tibble: 16 x 3
# Rowwise: carrier
carrier data model
<chr> <list<tbl_df[,18]>> <list>
1 9E [18,460 × 18] <lm>
2 AA [32,729 × 18] <lm>
3 AS [714 × 18] <lm>
4 B6 [54,635 × 18] <lm>
5 DL [48,110 × 18] <lm>
6 EV [54,173 × 18] <lm>
7 F9 [685 × 18] <lm>
8 FL [3,260 × 18] <lm>
9 HA [342 × 18] <lm>
10 MQ [26,397 × 18] <lm>
11 OO [32 × 18] <lm>
12 UA [58,665 × 18] <lm>
13 US [20,536 × 18] <lm>
14 VX [5,162 × 18] <lm>
15 WN [12,275 × 18] <lm>
16 YV [601 × 18] <lm>
mods %>%
summarize(
carrier = carrier,
`Adjusted Rsq` = summary(model)$adj.r.squared,
coef_dep_time = coef(model)[2]
)
# A tibble: 16 x 3
# Groups: carrier [16]
carrier `Adjusted Rsq` coef_dep_time
<chr> <dbl> <dbl>
1 9E 0.0513 0.0252
2 AA 0.0504 0.0209
3 AS 0.0815 0.0186
4 B6 0.0241 0.0120
5 DL 0.0347 0.0179
6 EV 0.0836 0.0290
7 F9 0.0998 0.0484
8 FL 0.0261 0.0183
9 HA -0.00124 -0.0578
10 MQ 0.0499 0.0218
11 OO -0.0189 0.0394
12 UA 0.0673 0.0220
13 US 0.0575 0.0174
14 VX 0.111 0.0362
15 WN 0.119 0.0345
16 YV 0.137 0.0805
You can use group_by on more than one variable, e.g. group_by(var1, var2)
Renaming Columns
Tibbles in the tidyverse don’t really have a problem with variable names starting with numbers or incorporating symbols and spaces. I would still suggest it is poor practice, because even if your data set looks fine, you’ll possibly encounter problems with modeling and visualization packages using that data. However, as a demonstration, we can ‘fix’ some of the variable names.
One issue is that when we scraped the data and converted it to a data.frame, the names that started with a number, like 3P
for ‘three point baskets made’, were made into X3P
, because that’s the way R works by default. In addition, 3P%
, i.e. three point percentage made, was made into 3P.
with a dot for the percent sign. Same goes for the 2P (two-pointers) and FT (free-throw) variables.
We can use rename to change column names. A basic example is as follows.
Very straightforward. However, oftentimes we’ll need to change patterns, as with our current problem. The following uses str_replace and str_remove from stringr to look for a pattern in a name, and replace that pattern with some other pattern. It uses regular expressions for the patterns.
bball = bball %>%
rename_with(
str_replace, # function
contains('.'), # columns
pattern = '\\.', # function arguments
replacement = '%'
) %>%
rename_with(str_remove, starts_with('X'), pattern = 'X')
colnames(bball)
[1] "Rk" "Player" "Pos" "Age" "Tm" "G" "GS" "MP" "FG" "FGA" "FG%" "3P" "3PA" "3P%"
[15] "2P" "2PA" "2P%" "eFG%" "FT" "FTA" "FT%" "ORB" "DRB" "TRB" "AST" "STL" "BLK" "TOV"
[29] "PF" "PTS" "trueShooting" "effectiveFG" "shootingDif"
Merging Data
Merging data is yet another very common data task, as data often comes from multiple sources. In order to do this, we need some common identifier among the sources by which to join them. The following is a list of dplyr join functions.
inner_join: return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
left_join: return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join: return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
semi_join: return all rows from x where there are matching values in y, keeping just columns from x. It differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
anti_join: return all rows from x where there are not matching values in y, keeping just columns from x.
full_join: return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
Probably the most common is a left join, where we have one primary data set, and are adding data from another source to it while retaining it as a base. The following is a simple demonstration.
# A tibble: 3 x 2
Name Band
<chr> <chr>
1 Seth Com Truise
2 Francis Pixies
3 Bubba The New Year
# A tibble: 3 x 2
Name Instrument
<chr> <chr>
1 Francis Guitar
2 Bubba Guitar
3 Seth Synthesizer
Joining, by = "Name"
# A tibble: 3 x 3
Name Band Instrument
<chr> <chr> <chr>
1 Seth Com Truise Synthesizer
2 Francis Pixies Guitar
3 Bubba The New Year Guitar
When we don’t have a one to one match, the result of the different types of join will become more apparent.
# A tibble: 4 x 2
Name Band
<chr> <chr>
1 Seth Com Truise
2 Francis Pixies
3 Bubba The New Year
4 Stephen Pavement
# A tibble: 4 x 2
Name Instrument
<chr> <chr>
1 Seth Synthesizer
2 Francis Guitar
3 Bubba Guitar
4 Steve Rage
Joining, by = "Name"
# A tibble: 4 x 3
Name Band Instrument
<chr> <chr> <chr>
1 Seth Com Truise Synthesizer
2 Francis Pixies Guitar
3 Bubba The New Year Guitar
4 Stephen Pavement <NA>
Joining, by = "Name"
# A tibble: 4 x 3
Name Band Instrument
<chr> <chr> <chr>
1 Seth Com Truise Synthesizer
2 Francis Pixies Guitar
3 Bubba The New Year Guitar
4 Steve <NA> Rage
Joining, by = "Name"
# A tibble: 3 x 3
Name Band Instrument
<chr> <chr> <chr>
1 Seth Com Truise Synthesizer
2 Francis Pixies Guitar
3 Bubba The New Year Guitar
Joining, by = "Name"
# A tibble: 5 x 3
Name Band Instrument
<chr> <chr> <chr>
1 Seth Com Truise Synthesizer
2 Francis Pixies Guitar
3 Bubba The New Year Guitar
4 Stephen Pavement <NA>
5 Steve <NA> Rage
Joining, by = "Name"
# A tibble: 1 x 2
Name Band
<chr> <chr>
1 Stephen Pavement
Joining, by = "Name"
# A tibble: 1 x 2
Name Instrument
<chr> <chr>
1 Steve Rage
Merges can get quite complex, and involve multiple data sources. In many cases you may have to do a lot of processing before getting to the merge, but dplyr’s joins will help quite a bit.
Pivoting axes
The tidyr package can be thought of as a specialized subset of dplyr’s functionality, as well as an update to the previous reshape and reshape2 packages7. Some of its functions for manipulating data you’ll want to be familiar with are:
- pivot_longer: convert data from a wider format to longer one
- pivot_wider: convert data from a longer format to wider one
- unite: paste together multiple columns into one
- separate: complement of unite
- unnest: expand ‘list columns’
The following example shows how we take a ‘wide-form’ data set, where multiple columns represent different stock prices, and turn it into two columns, one representing stock name, and one for the price. We need to know which columns to work on, which is the first entry. This function works very much like select, where you can use helpers. Then we need to give a name to the column(s) representing the indicators of what were multiple columns in the wide format. And finally we need to specify the column(s) of the values.
library(tidyr)
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
stocks %>% head
time X Y Z
1 2009-01-01 -1.23994442 -4.8515935 3.7985281
2 2009-01-02 0.65851483 0.9552487 -2.7255786
3 2009-01-03 -0.91146059 -0.0321312 0.6175274
4 2009-01-04 1.85598621 1.1919978 -2.4837558
5 2009-01-05 0.37266866 0.6297287 -1.1330732
6 2009-01-06 -0.06072664 -2.8673242 1.7155168
stocks %>%
pivot_longer(
cols = -time, # works similar to using select()
names_to = 'stock', # the name of the column that will have column names as labels
values_to = 'price' # the name of the column for the values
) %>%
head()
# A tibble: 6 x 3
time stock price
<date> <chr> <dbl>
1 2009-01-01 X -1.24
2 2009-01-01 Y -4.85
3 2009-01-01 Z 3.80
4 2009-01-02 X 0.659
5 2009-01-02 Y 0.955
6 2009-01-02 Z -2.73
Here is a more complex example where we can handle multiple repeated entries. We additionally add another column for labeling, and posit the separator for the column names.
library(tidyr)
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X_1 = rnorm(10, 0, 1),
X_2 = rnorm(10, 0, 1),
Y_1 = rnorm(10, 0, 2),
Y_2 = rnorm(10, 0, 2),
Z_1 = rnorm(10, 0, 4),
Z_2 = rnorm(10, 0, 4)
)
head(stocks)
time X_1 X_2 Y_1 Y_2 Z_1 Z_2
1 2009-01-01 -0.9675529 -0.72793192 0.7516393 0.03321408 3.7485540 0.3945022
2 2009-01-02 -0.1780449 0.08926355 -0.1976137 1.53569057 -0.0315400 7.6285628
3 2009-01-03 0.2958189 0.38118235 1.6730362 -1.13635638 0.1543268 -5.9254785
4 2009-01-04 -0.7805814 -0.67370673 -0.5696378 -3.62905335 -2.4256959 6.6867209
5 2009-01-05 1.7910958 -0.32353046 -1.6786235 -1.55989831 -4.4294289 -8.1844866
6 2009-01-06 1.1623828 -0.27362716 -0.3116307 2.73462718 0.6675895 1.9884072
stocks %>%
pivot_longer(
cols = -time,
names_to = c('stock', 'entry'),
names_sep = '_',
values_to = 'price'
) %>%
head()
# A tibble: 6 x 4
time stock entry price
<date> <chr> <chr> <dbl>
1 2009-01-01 X 1 -0.968
2 2009-01-01 X 2 -0.728
3 2009-01-01 Y 1 0.752
4 2009-01-01 Y 2 0.0332
5 2009-01-01 Z 1 3.75
6 2009-01-01 Z 2 0.395
Note that the latter is an example of tidy data while the former is not. Why do we generally prefer such data? Precisely because the most common data operations, grouping, filtering, etc., would work notably more efficiently with such data. This is especially the case for visualization.
The following demonstrates the separate function utilized for a very common data processing task- dealing with names. Here’ we’ll separate player into first and last names based on the space.
Rk first_name last_name Pos Age
1 1 Álex Abrines SG 25
2 2 Quincy Acy PF 28
3 3 Jaylen Adams PG 22
4 4 Steven Adams C 25
5 5 Bam Adebayo C 21
6 6 Deng Adel SF 21
Note that this won’t necessarily apply to every name, so further processing may be required.
More Tidyverse
dplyr functions: There are over a hundred utility functions that perform very common tasks. You really need to be aware of them, as their use will come up often.
broom: Convert statistical analysis objects from R into tidy data frames, so that they can more easily be combined, reshaped and otherwise processed with tools like dplyr, tidyr and ggplot2.
tidy*: a lot of packages out there are now ‘tidy’, though not a part of the official tidyverse. Some examples of the ones I’ve used:
- tidycensus
- tidybayes
- tidytext
- modelr
Seriously, there are a lot.
Personal Opinion
The dplyr grammar is clear for a lot of standard data processing tasks, and some not so common.
Extremely useful for data exploration and visualization.
- No need to create/overwrite existing objects
- Can overwrite columns and use as they are created
- Makes it easy to look at anything, and do otherwise tedious data checks
Drawbacks:
- Not as fast as data.table or even some base R approaches for many things8
- The mindset can make for unnecessary complication
- e.g. There is no need to pipe to create a single new variable
- Some approaches, are not very intuitive
- Notably less ability to work with some very common data structures (e.g. matrices)
All in all, if you’ve only been using base R approaches, the tidyverse will change your R life! It makes all the sorts of things you do all the time easier and clearer. Highly recommended!
Tidyverse Exercises
Exercise 0
Install and load the dplyr ggplot2movies packages. Look at the help file for the movies
data set, which contains data from IMDB.
Exercise 1
Using the movies data set, perform each of the following actions separately.
Exercise 1a
Use mutate to create a centered version of the rating variable. A centered variable is one whose mean has been subtracted from it. The process will take the following form:
Exercise 1b
Use filter to create a new data frame that has only movies from the years 2000 and beyond. Use the greater than or equal operator >=
.
Exercise 1c
Use select to create a new data frame that only has the title
, year
, budget
, length
, rating
and votes
variables. There are at least 3 ways to do this.
Exercise 1d
Rename the length
column to length_in_min
(i.e. length in minutes).
Exercise 2
Use group_by to group the data by year, and summarize to create a new variable that is the average budget. The summarize function works just like mutate in this case.
Use the mean function to get the average, but you’ll also need to use the argument na.rm = TRUE
within it because the earliest years have no budget recorded.
Exercise 3
Use pivot_longer to create a ‘tidy’ data set from the following.
Exercise 4
Now put several actions together in one set of piped operations.
- Filter movies released after 1990
- select the same variables as before but also the
mpaa
,Action
, andDrama
variables - group by
mpaa
and (your choice)Action
orDrama
- get the average rating
It should spit out something like the following:
# A tibble: 10 x 3
# Groups: mpaa [5]
mpaa Drama AvgRating
<chr> <int> <dbl>
1 "" 0 5.94
2 "" 1 6.20
3 "NC-17" 0 4.28
4 "NC-17" 1 4.62
5 "PG" 0 5.19
6 "PG" 1 6.15
7 "PG-13" 0 5.44
8 "PG-13" 1 6.14
9 "R" 0 4.86
10 "R" 1 5.94
Python Pandas Notebook
You may be thinking- ‘it’s 2020 and why on earth would anyone do that?!’. Peruse most sports websites and you’ll see that fundamental web design basics escape them. See also, financial sites.↩︎
If you’re following good programming practices, you’ll likely never use slice.↩︎
For more than one function, you could have supplied `funs(function1, function2, …).↩︎
As Hadley Wickham is from New Zealand, and his examples use summarise, you’ll probably see it about as much as you do the other spelling, especially since it will come up first in autocomplete.↩︎
Some still use reshape2 but there is no reason to and it is no longer developed.↩︎
There is multidplyr, dtplyr, and now tidyfast to help get as fast as possible for large data processing.↩︎