data.table

Another package for data processing that has been useful to many is data.table. It works in a notably different way than dplyr. However, you’d use it for the same reasons, e.g. subset, grouping, update, ordered joins etc., but with key advantages in speed and memory efficiency. Like dplyr, the data objects are both data.frames and a package specific class.

[1] "data.table" "data.frame"

Basics

In general, data.table works with brackets as in base R data frames. However, in order to use data.table effectively you’ll need to forget the data frame similarity. The brackets actually work like a function call, with several key arguments. Consider the following notation to start.

Importantly: you don’t use the brackets as you would with data.frames. What i and j can be are fairly complex.

In general, you use i for filtering by rows.

   x g         y
1: 6 2 0.3497804
   x g         y
1: 6 2 0.3497804

You use j to select (by name!) or create new columns. We can define a new column with the := operator.

[1] 1 6 8 9 3 4
[1] 1.903483 6.349780 8.979872 9.419413 3.133435 4.994455
   g       V1
1: 2 4.741608
2: 3 6.987164
   x g         y        z
1: 1 1 0.9034834 1.903483
2: 6 2 0.3497804 6.349780
3: 8 3 0.9798721 8.979872
4: 9 1 0.4194135 9.419413
5: 3 2 0.1334348 3.133435
6: 4 3 0.9944553 4.994455

Because j is an argument, dropping columns is awkward.

[1] -0.9034834 -0.3497804 -0.9798721 -0.4194135 -0.1334348 -0.9944553
   x g        z
1: 1 1 1.903483
2: 6 2 6.349780
3: 8 3 8.979872
4: 9 1 9.419413
5: 3 2 3.133435
6: 4 3 4.994455

Data table does not make unnecessary copies. For example if we do the following…

DT2 and DT are just names for the same table. You’d actually need to use the copy function to make an explicit copy, otherwise whatever you do to DT2 will be done to DT.

   A B q
1: 5 e 1
2: 4 d 1
3: 3 c 1
4: 2 b 1
5: 1 a 1
   A B
1: 5 e
2: 4 d
3: 3 c
4: 2 b
5: 1 a

Grouped operations

We can now attempt a ‘group-by’ operation, along with creation of a new variable. Note that these operations actually modify the dt object in place, a key distinction with dplyr. Fewer copies means less of a memory hit.

   g        V1
1: 1 11.322897
2: 2  9.483215
3: 3 13.974327
   x g         y        z mysum
1: 1 1 0.9034834 1.903483    10
2: 6 2 0.3497804 6.349780     9
3: 8 3 0.9798721 8.979872    12
4: 9 1 0.4194135 9.419413    10
5: 3 2 0.1334348 3.133435     9
6: 4 3 0.9944553 4.994455    12

We can also create groupings on the fly. For a new summary data set, we’ll take the following approach- we create a grouping based on whether g is a value of one or not, then get the mean and sum of x for those two categories. The corresponding dplyr approach is also shown (but not evaluated) for comparison.

       g mean_x sum_x
1:  TRUE   5.00    10
2: FALSE   5.25    21

Faster!

As mentioned, the reason to use data.table is speed. If you have large data or large operations it’ll be useful.

Joins

Joins can not only be faster but also easy to do. Note that the i argument can be a data.table object itself. I compare its speed to the comparable dplyr’s left_join function.

func mean (microseconds)
dt_join 627.20
dplyr_join 924.43

Group by

We can use the setkey function to order a data set by a certain column(s). This ordering is done by reference; again, no copy is made. Doing this will allow for faster grouped operations, though you likely will only see the speed gain with very large data. The timing regards creating a new variable

test_dt0 = data.table(x = rnorm(10000000),
                      g = sample(letters, 10000000, replace = T))
test_dt1 = copy(test_dt0)
test_dt2 = setkey(test_dt1, g)

identical(test_dt0, test_dt1)
[1] FALSE
identical(test_dt1, test_dt2)
[1] TRUE
test_dt0 = test_dt0[, mean := mean(x), by = g]
test_dt1 = test_dt1[, mean := mean(x), by = g]
test_dt2 = test_dt2[, mean := mean(x), by = g]
func mean (milliseconds)
test_dt0 390.93
test_dt1 140.93
test_dt2 142.61

String matching

The chin function returns a vector of the positions of (first) matches of its first argument in its second, where both arguments are character vectors. Essentially it’s just like the %in% function for character vectors.

Consider the following. We sample the first 14 letters 1000 times with replacement and see which ones match in a subset of another subset of letters. I compare the same operation to stringr and the stringi package whose functionality stringr using. They are both far slower than chin.

lets_1 = sample(letters[1:14], 1000, replace=T)

lets_1 %chin% letters[13:26] %>% head(10)
 [1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
# stri_detect_regex(lets_1, paste(letters[13:26], collapse='|'))

Reading files

If you use data.table for nothing else, you’d still want to consider it strongly for reading in large text files. The function fread may be quite useful in being memory efficient too. I compare it to readr.

func mean (microseconds)
dt 485.47
readr 3294.09

More speed

The following demonstrates some timings from here. I reproduced it on my own machine based on 50 million observations. The grouped operations that are applied are just a sum and length on a vector.

By the way, never, ever use aggregate. For anything.

fun elapsed
aggregate 56.86
by 18.12
dplyr 14.45
sapply 12.20
lapply 11.31
tapply 10.57
data.table 0.87

Ever.

Really.

Another thing to note is that the tidy approach is more about clarity and code efficiency relative to base R, as well as doing important background data checks and returning more usable results. In practice, it likely won’t be notably faster except in some cases, like with aggregate.

Pipe with data.table

Piping can be done with data.table objects too, using the brackets, but it’s awkward at best.

Probably better to just use a standard pipe and dot approach if you really need it.

Summary

Faster and more memory-efficient methods are great to have. If you have large data this is one package that can help.

  • For reading data
    • Especially for group-by and joins.

Drawbacks:

  • Complex
  • The syntax can be awkward
  • It doesn’t work like a data.frame, which can be confusing
  • Piping with brackets isn’t really feasible, and the dot approach is awkward
  • Does not have its own ‘verse’, though many packages use it

If speed and/or memory is (potentially) a concern, data.table.

For interactive exploration, dplyr.

Piping allows one to use both, so no need to choose.

And on the horizon…

Faster dplyr alternatives

So we have data.table as a starting point for faster data processing operations, but there are others. The dtplyr package implements the data.table back-end for dplyr, so that you can seamlessly use them together. The newer package tidyfast works directly with a data.table object, but uses dplyr-esque functions. The following shows times for a counting unique arrival times in the nycflights13 flights data (336776 rows).

package timing
dplyr 10.58
dtplyr 4.57
data.table 3.52
tidyfast 3.51
a Median time in milliseconds to do a count of arr_time on nycflights::flights

Just for giggles I did the same in Python with a pandas DataFrame, and it was notably slower than all of these options (almost 10x slower than standard dplyr). A lot of folks that use Python primarily think R is slow, but that is mostly because they don’t know how to effectively program with R for data science.

data.table Exercises

Exercise 0

Install and load the data.table package.

Create the following data table.

Exercise 1

Create a new object that contains only the ‘a’ group. Think back to how you use a logical to select rows.

Exercise 2

Create a new object that is the sum of z grouped by x. You don’t need to name the sum variable.