Another package for data processing that has been useful to many is data.table. It works in a notably different way than dplyr. However, you’d use it for the same reasons, e.g. subset, grouping, update, ordered joins etc., but with key advantages in speed and memory efficiency. Like dplyr, the data objects are both data.frames and a package specific class.
 "data.table" "data.frame"
In general, data.table works with brackets as in base R data frames. However, in order to use data.table effectively you’ll need to forget the data frame similarity. The brackets actually work like a function call, with several key arguments. Consider the following notation to start.
Importantly: you don’t use the brackets as you would with data.frames. What i and j can be are fairly complex.
In general, you use i for filtering by rows.
x g y 1: 5 2 0.1079452 x g y 1: 5 2 0.1079452
You use j to select (by name!) or create new columns. We can define a new column with the := operator.
 6 5 2 9 8 1  6.908980 5.107945 2.843715 9.780681 8.215221 1.334649 g V1 1: 2 6.661583 2: 3 2.089182 x g y z 1: 6 1 0.9089802 6.908980 2: 5 2 0.1079452 5.107945 3: 2 3 0.8437154 2.843715 4: 9 1 0.7806815 9.780681 5: 8 2 0.2152209 8.215221 6: 1 3 0.3346486 1.334649
Because j is an argument, dropping columns is awkward.
 -0.9089802 -0.1079452 -0.8437154 -0.7806815 -0.2152209 -0.3346486 x g z 1: 6 1 6.908980 2: 5 2 5.107945 3: 2 3 2.843715 4: 9 1 9.780681 5: 8 2 8.215221 6: 1 3 1.334649
Data table does not make unnecessary copies. For example if we do the following…
DT2 and DT are just names for the same table. You’d actually need to use the copy function to make an explicit copy, otherwise whatever you do to DT2 will be done to DT.
A B q 1: 5 e 1 2: 4 d 1 3: 3 c 1 4: 2 b 1 5: 1 a 1
A B 1: 5 e 2: 4 d 3: 3 c 4: 2 b 5: 1 a
We can now attempt a ‘group-by’ operation, along with creation of a new variable. Note that these operations actually modify the dt object in place, a key distinction with dplyr. Fewer copies means less of a memory hit.
g V1 1: 1 16.689662 2: 2 13.323166 3: 3 4.178364
x g y z mysum 1: 6 1 0.9089802 6.908980 15 2: 5 2 0.1079452 5.107945 13 3: 2 3 0.8437154 2.843715 3 4: 9 1 0.7806815 9.780681 15 5: 8 2 0.2152209 8.215221 13 6: 1 3 0.3346486 1.334649 3
We can also create groupings on the fly. For a new summary data set, we’ll take the following approach- we create a grouping based on whether
g is a value of one or not, then get the mean and sum of
x for those two categories. The corresponding dplyr approach is also shown (but not evaluated) for comparison.
g mean_x sum_x 1: TRUE 7.5 15 2: FALSE 4.0 16
As mentioned, the reason to use data.table is speed. If you have large data or large operations it’ll be useful.
Joins can not only be faster but also easy to do. Note that the
i argument can be a data.table object itself. I compare its speed to the comparable dplyr’s left_join function.
We can use the setkey function to order a data set by a certain column(s). This ordering is done by reference; again, no copy is made. Doing this will allow for faster grouped operations, though you likely will only see the speed gain with very large data. The timing regards creating a new variable
The chin function returns a vector of the positions of (first) matches of its first argument in its second, where both arguments are character vectors. Essentially it’s just like the %in% function for character vectors.
Consider the following. We sample the first 14 letters 1000 times with replacement and see which ones match in a subset of another subset of letters. I compare the same operation to stringr and the stringi package whose functionality stringr using. They are both far slower than chin.
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
If you use data.table for nothing else, you’d still want to consider it strongly for reading in large text files. The function fread may be quite useful in being memory efficient too. I compare it to readr.
The following demonstrates some timings from here. I reproduced it on my own machine based on 50 million observations. The grouped operations that are applied are just a sum and length on a vector.
By the way, never, ever use aggregate. For anything.
Another thing to note is that the tidy approach is more about clarity and code efficiency relative to base R, as well as doing important background data checks and returning more usable results. In practice, it likely won’t be notably faster except in some cases, like with aggregate.
Pipe with data.table
Piping can be done with data.table objects too, using the brackets, but it’s awkward at best.
Probably better to just use a standard pipe and dot approach if you really need it.
Faster and more memory-efficient methods are great to have. If you have large data this is one package that can help.
- For reading data
- Especially for group-by and joins.
- The syntax can be awkward
- It doesn’t work like a data.frame, which can be confusing
- Piping with brackets isn’t really feasible, and the dot approach is awkward
- Does not have its own ‘verse’, though many packages use it
If speed and/or memory is (potentially) a concern, data.table.
For interactive exploration, dplyr.
Piping allows one to use both, so no need to choose.
And on the horizon…
Faster dplyr Alternatives
So we have data.table as a starting point for faster data processing operations, but there are others. The dtplyr package implements the data.table back-end for dplyr, so that you can seamlessly use them together. The newer package tidyfast works directly with a data.table object, but uses dplyr-esque functions. The following shows times for a counting unique arrival times in the nycflights13 flights data (336776 rows).
|a Median time in milliseconds to do a count of arr_time on nycflights::flights|
Just for giggles I did the same in Python with a pandas DataFrame, and it was notably slower than all of these options (almost 10x slower than standard dplyr). A lot of folks that use Python think R is slow, but that is mostly because they don’t know how to effectively program with R for data science.
Out of memory situations
For very large data sets, especially in cases where distributed data solutions like Spark (and sparklyr) are not viable for practical or security reasons, you may need to try another approach. The disk.frame package does data processing on disk rather than in memory, as is the case with default R approaches. This allows you to process data that may be too large or time consuming to do so otherwise. For example, it’d be a great option if you are starting out with extremely large data, but for which your subset of interest is easily manageable within R. With disk.frame, you can do the initial filtering and selection before bringing it into memory.
Install and load the data.table package.
Create the following data table.
Create a new object that contains only the ‘a’ group. Think back to how you use a logical to select rows.
Create a new object that is the sum of z grouped by x. You don’t need to name the sum variable.