data.table

Another package for data processing that has been useful to many is data.table. It works in a notably different way than dplyr. However, you’d use it for the same reasons, e.g. subset, grouping, update, ordered joins etc., but with key advantages in speed and memory efficiency. Like dplyr, the data objects are both data.frames and a package specific class.

library(data.table)
dt = data.table(x = sample(1:10, 6),
                g = 1:3,
                y = runif(6))
class(dt)
[1] "data.table" "data.frame"

data.table Basics

In general, data.table works with brackets as in base R data frames. However, in order to use data.table effectively you’ll need to forget the data frame similarity. The brackets actually work like a function call, with several key arguments. Consider the following notation to start.

x[i, j, by, keyby, with = TRUE, ...]

Importantly: you don’t use the brackets as you would with data.frames. What i and j can be are fairly complex.

In general, you use i for filtering by rows.

dt[2]    # rows! not columns as with standard data.frame
dt[2,]
   x g         y
1: 5 2 0.1079452
   x g         y
1: 5 2 0.1079452

You use j to select (by name!) or create new columns. We can define a new column with the := operator.

dt[,x]
dt[,z := x+y]  # dt now has a new column
dt[,z]
dt[g > 1, mean(z), by = g]
dt
[1] 6 5 2 9 8 1
[1] 6.908980 5.107945 2.843715 9.780681 8.215221 1.334649
   g       V1
1: 2 6.661583
2: 3 2.089182
   x g         y        z
1: 6 1 0.9089802 6.908980
2: 5 2 0.1079452 5.107945
3: 2 3 0.8437154 2.843715
4: 9 1 0.7806815 9.780681
5: 8 2 0.2152209 8.215221
6: 1 3 0.3346486 1.334649

Because j is an argument, dropping columns is awkward.

dt[, -y]             # creates negative values of y
dt[, -'y', with = F] # drops y, but now needs quotes
## dt[, y := NULL]      # drops y, but this is just a base R approach
## dt$y = NULL
[1] -0.9089802 -0.1079452 -0.8437154 -0.7806815 -0.2152209 -0.3346486
   x g        z
1: 6 1 6.908980
2: 5 2 5.107945
3: 2 3 2.843715
4: 9 1 9.780681
5: 8 2 8.215221
6: 1 3 1.334649

Data table does not make unnecessary copies. For example if we do the following…

DT = data.table(A = 5:1, B = letters[5:1])
DT2 = DT
DT3 = copy(DT)  

DT2 and DT are just names for the same table. You’d actually need to use the copy function to make an explicit copy, otherwise whatever you do to DT2 will be done to DT.

DT2[,q:=1]
DT
   A B q
1: 5 e 1
2: 4 d 1
3: 3 c 1
4: 2 b 1
5: 1 a 1
DT3 
   A B
1: 5 e
2: 4 d
3: 3 c
4: 2 b
5: 1 a

Grouped Operations

We can now attempt a ‘group-by’ operation, along with creation of a new variable. Note that these operations actually modify the dt object in place, a key distinction with dplyr. Fewer copies means less of a memory hit.

dt1 = dt2 = dt
dt[, sum(x, y), by = g]                # sum of all x and y values
   g        V1
1: 1 16.689662
2: 2 13.323166
3: 3  4.178364
dt1[, mysum := sum(x), by = g]         # add new variable to the original data
dt1
   x g         y        z mysum
1: 6 1 0.9089802 6.908980    15
2: 5 2 0.1079452 5.107945    13
3: 2 3 0.8437154 2.843715     3
4: 9 1 0.7806815 9.780681    15
5: 8 2 0.2152209 8.215221    13
6: 1 3 0.3346486 1.334649     3

We can also create groupings on the fly. For a new summary data set, we’ll take the following approach- we create a grouping based on whether g is a value of one or not, then get the mean and sum of x for those two categories. The corresponding dplyr approach is also shown (but not evaluated) for comparison.

dt2[, list(mean_x = mean(x), sum_x = sum(x)), by = g == 1]
       g mean_x sum_x
1:  TRUE    7.5    15
2: FALSE    4.0    16
## dt2 %>% 
##   group_by(g == 1) %>% 
##   summarise(mean_x = mean(x), sum_x = sum(x))

Faster!

As mentioned, the reason to use data.table is speed. If you have large data or large operations it’ll be useful.

Joins

Joins can not only be faster but also easy to do. Note that the i argument can be a data.table object itself. I compare its speed to the comparable dplyr’s left_join function.

dt1 = setkey(dt1, x)
dt1[dt2]

dt1_df = dt2_df = as.data.frame(dt1)
left_join(dt1_df, dt2_df, by = 'x')
func mean (microseconds)
dt_join 504.77
dplyr_join 1588.46

Group by

We can use the setkey function to order a data set by a certain column(s). This ordering is done by reference; again, no copy is made. Doing this will allow for faster grouped operations, though you likely will only see the speed gain with very large data. The timing regards creating a new variable

test_dt0 = data.table(x = rnorm(10000000),
                      g = sample(letters, 10000000, replace = T))
test_dt1 = copy(test_dt0)
test_dt2 = setkey(test_dt1, g)

identical(test_dt0, test_dt1)
[1] FALSE
identical(test_dt1, test_dt2)
[1] TRUE
test_dt0 = test_dt0[, mean := mean(x), by = g]
test_dt1 = test_dt1[, mean := mean(x), by = g]
test_dt2 = test_dt2[, mean := mean(x), by = g]
func mean (milliseconds)
test_dt0 381.29
test_dt1 118.52
test_dt2 109.97

String matching

The chin function returns a vector of the positions of (first) matches of its first argument in its second, where both arguments are character vectors. Essentially it’s just like the %in% function for character vectors.

Consider the following. We sample the first 14 letters 1000 times with replacement and see which ones match in a subset of another subset of letters. I compare the same operation to stringr and the stringi package whose functionality stringr using. They are both far slower than chin.

lets_1 = sample(letters[1:14], 1000, replace=T)

lets_1 %chin% letters[13:26] %>% head(10)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# stri_detect_regex(lets_1, paste(letters[13:26], collapse='|'))

Reading files

If you use data.table for nothing else, you’d still want to consider it strongly for reading in large text files. The function fread may be quite useful in being memory efficient too. I compare it to readr.

fread('data/cars.csv')
func mean (microseconds)
dt 430.91
readr 2900.19

More speed

The following demonstrates some timings from here. I reproduced it on my own machine based on 50 million observations. The grouped operations that are applied are just a sum and length on a vector.

By the way, never, ever use aggregate. For anything.

fun elapsed
aggregate 56.857
by 18.118
dplyr 14.447
sapply 12.200
lapply 11.309
tapply 10.570
data.table 0.866

Ever.

Really.

Another thing to note is that the tidy approach is more about clarity and code efficiency relative to base R, as well as doing important background data checks and returning more usable results. In practice, it likely won’t be notably faster except in some cases, like with aggregate.

Pipe with data.table

Piping can be done with data.table objects too, using the brackets, but it’s awkward at best.

mydt[, newvar := mean(x), ][, newvar2 := sum(newvar), by = group][, -'y', with = FALSE]
mydt[, newvar := mean(x),
     ][, newvar2 := sum(newvar), by = group
       ][,-'y', with=FALSE]

Probably better to just use a standard pipe and dot approach if you really need it.

mydt[, newvar := mean(x), ] %>%
  .[, newvar2 := sum(newvar), by = group] %>%
  .[, -'y', with = FALSE]

data.table Summary

Faster and more memory-efficient methods are great to have. If you have large data this is one package that can help.

  • For reading data
  • Especially for group-by and joins.

Drawbacks:

  • Complex
  • The syntax can be awkward
  • It doesn’t work like a data.frame, which can be confusing
  • Piping with brackets isn’t really feasible, and the dot approach is awkward
  • Does not have its own ‘verse’, though many packages use it

If speed and/or memory is (potentially) a concern, data.table.

For interactive exploration, dplyr.

Piping allows one to use both, so no need to choose.

And on the horizon…

Faster dplyr Alternatives

So we have data.table as a starting point for faster data processing operations, but there are others. The dtplyr package implements the data.table back-end for dplyr, so that you can seamlessly use them together. The newer package tidyfast works directly with a data.table object, but uses dplyr-esque functions. The following shows times for a counting unique arrival times in the nycflights13 flights data (336776 rows).

package timing
dplyr 10.580
dtplyr 4.575
data.table 3.519
tidyfast 3.507
a Median time in milliseconds to do a count of arr_time on nycflights::flights

Just for giggles I did the same in Python with a pandas DataFrame, and it was notably slower than all of these options (almost 10x slower than standard dplyr). A lot of folks that use Python think R is slow, but that is mostly because they don’t know how to effectively program with R for data science.

Out of memory situations

For very large data sets, especially in cases where distributed data solutions like Spark (and sparklyr) are not viable for practical or security reasons, you may need to try another approach. The disk.frame package does data processing on disk rather than in memory, as is the case with default R approaches. This allows you to process data that may be too large or time consuming to do so otherwise. For example, it’d be a great option if you are starting out with extremely large data, but for which your subset of interest is easily manageable within R. With disk.frame, you can do the initial filtering and selection before bringing it into memory.

data.table Exercises

Exercise 0

Install and load the data.table package.

Create the following data table.

mydt = data.table(
  expand.grid(x = 1:3, 
              y = c('a', 'b', 'c')), 
  z = sample(1:20, 9)
)

Exercise 1

Create a new object that contains only the ‘a’ group. Think back to how you use a logical to select rows.

Exercise 2

Create a new object that is the sum of z grouped by x. You don’t need to name the sum variable.