library(dplyr)
library(tidyr)
library(dtplyr)
library(tidyfast)
library(data.table)
library(collapse)
library(vctrs)
library(bench) # for timings
NB: This post was revisited when updating the website early 2025, and some changes were required. Attempts to keep things consistent were made, but if you feel you’ve found an issue, please post it at GitHub.
Introduction
Oftentimes I’m looking to gain speed/memory advantages, or maybe just exploring how to do the same thing differently in case it becomes useful later on. I thought I’d start posting them, but then I don’t get around to it. Here are a few that have come up over the past year (or two 😞), and even as I wrote this, more examples kept coming up, so I may come back to add more in the future.
Quick summary
Data processing efficiency really depends on context. Faster doesn’t mean memory efficient, and what may be the best in a standard setting can be notably worse in others. Some approaches won’t show their value until the data is very large, or there are many groups to deal with, while others will get notably worse. Also, you may not want an additional package dependency beyond what you’re using, and may need a base R approach. The good news is you’ll always have options!
One caveat: I’m not saying that the following timed approaches are necessarily the best/fastest, I mostly stuck to ones I’d try first. You may find even better for your situation! A great resource to keep in mind is the fastverse, which is a collection of packages with speed and efficiency in mind, and includes a couple that are demonstrated here.
Setup and orientation
Required packages.
Also, in the following bench marks I turn off checking if the results are equivalent (i.e. check = FALSE
) because even if the resulting data is the same, the objects may be different classes, or some objects may even be of the same class, but have different attributes. You are welcome to double check that you would get the same thing as I did. Also, you might want to look at the autoplot
of the bench mark summaries, as many results have some variability that isn’t captured by just looking at median/best times.
Fill in missing values
We’ll start with the problem of filling in missing values by group. I’ve created a realistic example where the missingness is seen randomly across multiple columns, and differently across groups. I’ve chosen to compare tidyr, tidyfast, the underlying approach of tidyr via vctrs, and data.table.
Note that only data.table is not last observation carried forward by default (‘down’ in tidyr parlance), so that argument is made explicit. All of these objects will have different attributes or classes. tidyfast for some reason renames the grouping variable to ‘by’. If you wrap all of these in data.frame
, that will remove the attributes and give them all the same class, so you can verify they return the same result.
set.seed(1234)
= 1e5
N = 5000
Ng
<- function(x) {
create_missing sample(1:length(x), 5)] = NA
x[
x
}
= tibble(grp = rep(1:Ng, e = N / Ng)) %>%
df_missing arrange(grp) %>%
group_by(grp) %>%
mutate(
x = 1:n(),
y = rpois(n(), 5),
z = rnorm(n(), 5),
across(x:z, create_missing)
%>%
) ungroup()
%>% head(5)
df_missing
= as.data.table(df_missing) %>% setkey(grp)
dt_missing = copy(dt_missing)
tf_missing
<-
bm_fill ::mark(
bench%!in%
tidyr = fill(group_by(df_missing, grp), x:z),
tidyfast = dt_fill(tf_missing, x, y, z, id = grp),
vctrs = df_missing %>% group_by(grp) %>% mutate(across(x:z, vec_fill_missing)),
dt = dt_missing[
,x = nafill(x, type = 'locf'),
.(y = nafill(y, type = 'locf'),
z = nafill(z, type = 'locf')),
by = grp
],check = FALSE,
iterations = 10
)
This is a great example of where there is a notable speed/memory trade-off. Very surprising how much memory data.table uses1, while not giving much speed advantage relative to the tidyr. Perhaps there is something I’m missing (😏)? Also note that we can get an even ‘tidier’ advantage by using vctrs directly, rather than wrapping it via tidyr, and seeing how easy it is to use, it’s probably the best option.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
tidyfast | 18.3ms | 24.5ms | 39.87MB | 1 | 6.06 |
vctrs | 27ms | 28.5ms | 6.58MB | 1.16 | 1 |
dt | 111.7ms | 117.8ms | 237.08MB | 4.81 | 36.04 |
tidyr | 132.3ms | 147.2ms | 6.59MB | 6.01 | 1 |
Antijoins
Sometimes you just don’t want that! In the following we have a situation where we want to filter values based on the negation of some condition. Think of a case where certain person IDs are not viable for consideration for analysis. Many times, a natural approach would be to use something like a filter where instead of using vals %in% values_desired
, we just negate that with a bang (!
) operator. However, another approach is to create a data frame of the undesired values and use an anti_join
. When using joins in general, you get a relative advantage by explicitly noting the variables you’re joining on, so I compare that as well for demonstration. Finally, in this particular example we could use data.table’s built-in character match, chin
.
set.seed(123)
= tibble(
df1 id = sample(letters, 10000, replace = TRUE)
)
= tibble(
df2 id = sample(letters[1:10], 10000, replace = TRUE)
)
= lazy_dt(df1)
df1_lazy = lazy_dt(df2)
df2_lazy
= data.table(df1)
df1_dt = data.table(df2)
df2_dt
suppressMessages({
= bench::mark(
bm_antijoin in_ = filter(df1, !id %in% letters[1:10]),
in_dtp = collect(filter(df1_lazy, !id %in% letters[1:10])), # not usable until collected/as_tibbled
chin = filter(df1, !id %chin% letters[1:10]), # chin for char vector only, from data.table
chin_dt = df1_dt[!df1_dt$id %chin% letters[1:10],],
coll = fsubset(df1, id %!in% letters[1:10]), # can work with dt or tidyverse
aj = anti_join(df1, df2, by = 'id'),
aj_noby = anti_join(df1, df2),
iterations = 100,
check = FALSE
) })
In this case, the fully data.table approach is best in speed and memory, but collapse is not close behind2. In addition, if you are in the tidyverse, the anti_join
function is a very good option. Hopefully the lesson about explicitly setting the by
argument is made clear.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
chin_dt | 152µs | 160µs | 206KB | 1 | 1 |
coll | 167µs | 185µs | 268KB | 1.16 | 1.3 |
aj | 529µs | 559µs | 293KB | 3.5 | 1.42 |
chin | 626µs | 662µs | 270KB | 4.15 | 1.31 |
in_ | 716µs | 804µs | 387KB | 5.04 | 1.88 |
in_dtp | 923µs | 988µs | 479KB | 6.2 | 2.33 |
aj_noby | 9.95ms | 10.962ms | 323KB | 68.72 | 1.57 |
Lag/lead/differences
Here we are interested in getting the difference in the current value of some feature from it’s last (or next) value, typically called a lag (lead). Note that it doesn’t have to be the last value, but that is most common. In the tidyverse we have lag
/lead
functions, or with data.table, we have the generic shift
function that can do both. In the following I look at using that function in the fully data.table situation or within a tibble.
set.seed(1234)
= 1e5
N = 5000
Ng
= tibble(
df x = rpois(N, 10),
grp = rep(1:Ng, e = N / Ng)
)
= as.data.table(df)
dt
= bench::mark(
bm_lag dplyr_lag = mutate(df, x_diff = x - lag(x)),
dplyr_lead = mutate(df, x_diff = x - lead(x)),
dt_lag = dt[, x_diff := x - shift(x)],
dt_lead = dt[, x_diff := x - shift(x, n = -1)],
dt_dp_lag = mutate(df, x_diff = x - shift(x)),
dt_dp_lead = mutate(df, x_diff = x - shift(x, n = -1)),
coll_lag = ftransform(df, x_diff = x - flag(x)),
coll_lead = ftransform(df, x_diff = x - flag(x, n = -1)),
iterations = 100,
check = FALSE
)
In this case, collapse is best, with data.table not far behind, but using the shift
function within the tidy approach is a very solid gain. Oddly, lag
and lead
seem somewhat different in terms of speed and memory.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
coll_lag | 226µs | 260µs | 783.84KB | 1 | 1 |
coll_lead | 250µs | 309µs | 783.84KB | 1.18 | 1 |
dt_lag | 316µs | 354µs | 813.87KB | 1.36 | 1.04 |
dt_lead | 319µs | 355µs | 813.87KB | 1.36 | 1.04 |
dt_dp_lag | 792µs | 818µs | 782.91KB | 3.14 | 1 |
dt_dp_lead | 797µs | 892µs | 782.91KB | 3.42 | 1 |
dplyr_lead | 994µs | 1.115ms | 1.53MB | 4.28 | 2 |
dplyr_lag | 1.044ms | 1.2ms | 1.15MB | 4.61 | 1.5 |
What about a grouped scenario? To keep it simple we’ll just look at using lagged values.
= bench::mark(
bm_lag_grp dt_lag = dt[, x_diff := x - shift(x), by = grp],
dt_dp_lag = mutate(group_by(df, grp), x_diff = x - shift(x)),
dplyr_lag = mutate(group_by(df, grp), x_diff = x - lag(x)),
coll_lag = fmutate(fgroup_by(df, grp), x_diff = x - flag(x)),
iterations = 10,
check = FALSE
)
In the grouped situation, using a collapse isn’t best for memory, but the speed gain is ridiculous!!
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
coll_lag | 484µs | 548µs | 1.59MB | 1 | 3.52 |
dt_lag | 25.692ms | 28.067ms | 462.32KB | 51.2 | 1 |
dt_dp_lag | 30.843ms | 32.969ms | 2.61MB | 60.15 | 5.77 |
dplyr_lag | 77.156ms | 78.771ms | 5.19MB | 143.7 | 11.49 |
First/Last
In this demo, we want to take the first (last) value of each group. Surprisingly, for the same functionality, it turns out that the number of groups matter when doing groupwise operations. For the following I’ll even use a base R approach (though within dplyr’s mutate) to demonstrate some differences.
set.seed(1234)
= 100000
N
= tibble(
df x = rpois(N, 10),
id = sample(1:100, N, replace = TRUE)
)
= as.data.table(df)
dt
= bench::mark(
bm_first base_first = summarize(group_by(df, id), x = x[1]),
base_last = summarize(group_by(df, id), x = x[length(x)]),
dplyr_first = summarize(group_by(df, id), x = dplyr::first(x)),
dplyr_last = summarize(group_by(df, id), x = dplyr::last(x)),
dt_first = dt[, .(x = data.table::first(x)), by = id],
dt_last = dt[, .(x = data.table::last(x)), by = id],
coll_first = ffirst(fgroup_by(df, id)),
coll_last = flast(fgroup_by(df, id)),
iterations = 100,
check = FALSE
)
The first result is actually not too surprising, in that the fully dt approaches are fast and memory efficient, though collapse is notably faster. Somewhat interesting is that the base last is a bit faster than dplyr’s last
(technically nth
) approach.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
coll_last | 307µs | 341µs | 783.53KB | 1 | 1.72 |
coll_first | 299µs | 345µs | 783.53KB | 1.01 | 1.72 |
dt_last | 698µs | 876µs | 454.62KB | 2.57 | 1 |
dt_first | 689µs | 895µs | 454.62KB | 2.62 | 1 |
base_last | 1.944ms | 2.036ms | 2.06MB | 5.97 | 4.64 |
dplyr_first | 2.026ms | 2.127ms | 2.06MB | 6.24 | 4.64 |
base_first | 1.984ms | 2.165ms | 2.06MB | 6.35 | 4.64 |
dplyr_last | 2.111ms | 2.553ms | 2.06MB | 7.49 | 4.64 |
In the following, the only thing that changes is the number of groups.
set.seed(1234)
= 100000
N
= tibble(
df x = rpois(N, 10),
id = sample(1:(N/10), N, replace = TRUE) # <--- change is here
)
= as.data.table(df)
dt
= bench::mark(
bm_first_more_groups base_first = summarize(group_by(df, id), x = x[1]),
base_last = summarize(group_by(df, id), x = x[length(x)]),
dplyr_first = summarize(group_by(df, id), x = dplyr::first(x)),
dplyr_last = summarize(group_by(df, id), x = dplyr::last(x)),
dt_first = dt[, .(x = data.table::first(x)), by = id],
dt_last = dt[, .(x = data.table::last(x)), by = id],
coll_first = ffirst(group_by(df, id)),
coll_last = flast(group_by(df, id)),
iterations = 100,
check = FALSE
)
Now what the heck is going on here? The base R approach is way faster than even data.table, while not using any more memory than what dplyr is doing (because of the group-by-summarize). More to the point is that collapse is notably faster than the other options, but still a bit heavy memory-wise relative to data.table.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
coll_last | 2.813ms | 3.006ms | 2.79MB | 1 | 3.8 |
coll_first | 2.838ms | 3.128ms | 2.79MB | 1.04 | 3.8 |
base_first | 10.596ms | 11.4ms | 3.06MB | 3.79 | 4.17 |
base_last | 12.551ms | 13.151ms | 3.06MB | 4.37 | 4.17 |
dt_last | 17.588ms | 18.318ms | 752.15KB | 6.09 | 1 |
dt_first | 17.671ms | 18.379ms | 752.15KB | 6.11 | 1 |
dplyr_first | 20.066ms | 20.943ms | 3.06MB | 6.97 | 4.17 |
dplyr_last | 20.916ms | 21.357ms | 3.16MB | 7.1 | 4.31 |
Coalesce/ifelse
It’s very often we want to change a single value based on some condition, often starting with ifelse
. This is similar to our previous fill situation for missing values, but applies a constant as opposed to last/next value. Coalesce
is similar to tidyr’s fill
, and is often used in cases where we might otherwise use an ifelse
style approach . In the following, we want to change NA values to zero, and there are many ways we might go about it.
set.seed(1234)
= rnorm(1000)
x > 2] = NA
x[x
= bench::mark(
bm_coalesce base = {x[is.na(x)] <- 0; x},
ifelse = ifelse(is.na(x), 0, x),
if_else = if_else(is.na(x), 0, x),
vctrs = vec_assign(x, is.na(x), 0),
tidyr = replace_na(x, 0),
fifelse = fifelse(is.na(x), 0, x),
coalesce = coalesce(x, 0),
fcoalesce = fcoalesce(x, 0),
nafill = nafill(x, fill = 0),
coll = replace_NA(x) # default is 0
)
The key result here to me is just how much memory the dplyr if_else
approach is using, as well as how fast and memory efficient the base R approach is even with a second step. While providing type safety, if_else
is both slow and a memory hog, so probably anything else is better. tidyr itself would be a good option here, and while it makes up for the memory issue, it’s relatively slow compared to other approaches, including the function it’s a wrapper for (vec_assign
), which is also demonstrated. Interestingly, fcoalesce
and fifelse
would both be better options than data.table’s other approach that is explicitly for this task.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
fcoalesce | 1µs | 1µs | 7.86KB | 1 | 1 |
coll | 1µs | 1µs | 7.86KB | 1.03 | 1 |
base | 2µs | 3µs | 7.91KB | 2.17 | 1.01 |
fifelse | 3µs | 4µs | 11.81KB | 2.6 | 1.5 |
vctrs | 4µs | 4µs | 11.81KB | 3.09 | 1.5 |
nafill | 5µs | 6µs | 23.95KB | 4.14 | 3.05 |
tidyr | 9µs | 10µs | 11.81KB | 7.09 | 1.5 |
coalesce | 15µs | 17µs | 20.19KB | 11.71 | 2.57 |
ifelse | 15µs | 17µs | 47.31KB | 12.11 | 6.02 |
if_else | 21µs | 25µs | 71.06KB | 17.54 | 9.04 |
Conditional Slide
I recently had a problem where I wanted to do a apply a certain function that required taking the difference between the current and last value as we did in the lag demo. The problem was that ‘last’ depended on a specific condition being met. The basic idea is that we want to take x - lag(x) but where the condition is FALSE
, we need to basically ignore that value for consideration as the last value, and only use the previous value for which the condition is TRUE
. In the following, for the first two values where the condition is met, this is straightforward (6 minus 10). But for the fourth row, 4 should subtract 6, rather than 5, because the condition is FALSE
.
set.seed(1234)
= tibble(
df x = sample(1:10),
cond = c(TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE),
group = rep(c('a', 'b'), e = 5)
)
df
# A tibble: 10 × 3
x cond group
<int> <lgl> <chr>
1 10 TRUE a
2 6 TRUE a
3 5 FALSE a
4 4 TRUE a
5 1 FALSE a
6 8 TRUE b
7 2 FALSE b
8 7 FALSE b
9 9 TRUE b
10 3 FALSE b
While somewhat simple in concept, it doesn’t really work with simple lags, as the answer would be wrong, or sliding functions, because the window is adaptive. I wrote the following function to deal with this. By default, it basically takes our vector under consideration, x
, makes it NA
where the condition doesn’t hold, then fills in the NA values with the last value using the vec_fill_missing
(or a supplied constant/single value). However there is flexibility beyond that type of fill. In addition, the function applied is generic, and could be applied to the newly created variable (.x
), or use both the original (x
) and the newly created variable.
<-
conditional_slide function(x,
condition,
fun,direction = c("down"),
fill_value = NA,
na_value = NA,
...) {
if (!direction %in% c("constant", "down", "up", "downup", "updown"))
::abort('direction must be one of "constant", "down", "up", "downup", "updown"')
rlang
if (length(x) != length(condition))
::abort('condition and x must be the same length')
rlang
# can't use dplyr/dt ifelse since we won't know class type of fill_value
<- ifelse(direction == 'constant', fill_value, NA)
conditional_val <- ifelse(condition, x, conditional_val)
.x
if (direction != 'constant')
<- vctrs::vec_fill_missing(.x, direction = direction)
.x
class(.x) <- class(x)
<- fun(x, .x, ...)
result
if (!is.na(na_value))
is.na(result)] <- na_value
result[
result }
The first example applies the function, x - lag(x)
, to our dataset, and which in my case, I also wanted to apply within groups, which caused further problems for some of the available functions I thought would otherwise be applicable. I also show it for another type of problem, taking the cumulative sum, as well as just conditionally changing the values to zero.
%>%
df group_by(group) %>%
mutate(
# demo first difference
simple_diff = x - dplyr::lag(x),
cond_diff = conditional_slide(x, cond, fun = \(x, .x) x - lag(.x), na_value = 0),
# demo cumulative sum
simple_cumsum = cumsum(x),
cond_cumsum = conditional_slide(
x,
cond,fun = \(x, .x) cumsum(.x),
direction = 'constant',
fill = 0
),
# demo fill last
simple_fill_last = vec_fill_missing(x),
cond_fill_last = conditional_slide(
x,
cond,fun = \(x, .x) .x,
direction = 'down'
) )
# A tibble: 10 × 9
# Groups: group [2]
x cond group simple_diff cond_diff simple_cumsum cond_cumsum
<int> <lgl> <chr> <int> <dbl> <int> <int>
1 10 TRUE a NA 0 10 10
2 6 TRUE a -4 -4 16 16
3 5 FALSE a -1 -1 21 16
4 4 TRUE a -1 -2 25 20
5 1 FALSE a -3 -3 26 20
6 8 TRUE b NA 0 8 8
7 2 FALSE b -6 -6 10 8
8 7 FALSE b 5 -1 17 8
9 9 TRUE b 2 1 26 17
10 3 FALSE b -6 -6 29 17
# ℹ 2 more variables: simple_fill_last <int>, cond_fill_last <int>
This is one of those things that comes up from time to time where trying to apply a standard tool likely won’t cut it. You may find similar situations where you need to modify what’s available and create some functionality tailored to your needs.
Take the first TRUE
Sometimes we want the first instance of a condition. For example, we might want the position or value of the first number > than some value. We’ve already investigated using dplyr or data.table’s first
, and I won’t do so again here except to say they are both notably slower and worse on memory here. We have a few approaches we might take in base R. Using which
would be common, but there is also which.max
, that, when applied to logical vectors, gives the position of the first TRUE
(which.min
gives the position of the first FALSE
). In addition, there is the Position
function, which I didn’t even know about until messing with this problem.
set.seed(123)
= sort(rnorm(10000))
x
= 2
marker
= bench::mark(
bm_first_true_1 which = which(x > marker)[1],
which_max = which.max(x > marker),
pos = Position(\(x) x > marker, x)
)
# make it slightly more challenging
= sort(rnorm(1e6))
x
= 4
marker
= bench::mark(
bm_first_true_2 which = which(x > marker)[1],
which_max = which.max(x > marker),
pos = Position(\(x) x > marker, x)
)
Interestingly Position
provides the best memory performance, but is prohibitively slower. which.max
is probably your best bet.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
which | 15µs | 20µs | 79.14KB | 1 | 14.19 |
which_max | 18µs | 20µs | 39.11KB | 1.01 | 7.01 |
pos | 1.986ms | 2.158ms | 5.58KB | 109.67 | 1 |
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
which | 1.44ms | 1.69ms | 7.63MB | 1 | 1400.58 |
which_max | 1.79ms | 2ms | 3.81MB | 1.18 | 700.29 |
pos | 213.55ms | 219.13ms | 5.58KB | 129.4 | 1 |
But not so fast? The following makes the first case come very quickly, where Position
blows the other options out of the water! I guess if you knew this was going to be the case you could take serious advantage.
set.seed(123)
= sort(rnorm(100000), decreasing = TRUE)
x
1:30] = 4
x[
= bench::mark(
bm_first_true_3 which = which(x < marker)[1],
which_max = which.max(x < marker),
pos = Position(\(x) x < marker, x)
)
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
pos_rev | 10µs | 10µs | 5.58KB | 1 | 1 |
which_max_rev | 110µs | 130µs | 390.67KB | 13.04 | 70.04 |
which_rev | 160µs | 220µs | 1.14MB | 22.55 | 210.09 |
Group by filtering/slicing
The previous situation was the basis for this next demo where we utilize which.max
. Here we want to filter in one scenario, such that if all values are zero, we drop them, and in the second, we want to only retain certain values based on a condition. In this latter case, the condition is that at least one non-zero has occurred, in which case we want to keep all of those values from that point on (even if they are zero).
To make things more clear, for the example data that follows, we want to drop group 1 entirely, the initial part of group 2, and retain all of group 3.
library(tidyverse)
set.seed(12345)
= tibble(
df group = rep(1:3, e = 10),
value = c(
rep(0, 10),
c(rep(0, 3), sample(0:5, 7, replace = TRUE)),
sample(0:10, 10, replace = TRUE)
)
)
%>%
df group_by(group) %>%
filter(!all(value == 0)) %>%
slice(which.max(value > 0):n())
# A tibble: 17 × 2
# Groups: group [2]
group value
<int> <dbl>
1 2 5
2 2 2
3 2 1
4 2 3
5 2 1
6 2 4
7 2 2
8 3 7
9 3 1
10 3 5
11 3 10
12 3 5
13 3 6
14 3 9
15 3 0
16 3 7
17 3 6
In the above scenario, we take two steps to illustrate our desired outcome conceptually. Ideally though, we’d like one step, because it is just a general filtering. You might think maybe to change which.max
to which
and just slice, but this would remove all zeros, when we want to retain zeros after the point where at least some values are greater than zero. Using row_number
was a way I thought to get around things.
%>%
df group_by(group) %>%
filter(!all(value == 0) & row_number() >= which.max(value > 0))
= bench::mark(
bm_filter_slice orig = df %>%
group_by(group) %>%
filter(!all(value == 0)) %>%
slice(which.max(value > 0):n()),
new = df %>%
group_by(group) %>%
filter(!all(value == 0) & row_number() >= which.max(value > 0))
)
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
orig | 2ms | 2.2ms | 10.25KB | 1 | 1.11 |
new | 6.1ms | 7.1ms | 9.21KB | 3.28 | 1 |
Well we got it to one operation, but now it takes longer and has no memory advantage. Are we on the wrong track? Let’s try with a realistically sized data set with a lot of groups.
set.seed(1234)
= 100000
N = 1:(N/4)
g
= tibble(
df group = rep(g, e = 4),
value = sample(0:5, size = N, replace = TRUE)
)
= bench::mark(
bm_filter_slice2 orig = df %>%
group_by(group) %>%
filter(!all(value == 0)) %>%
slice(which.max(value > 0):n()),
new = df %>%
group_by(group) %>%
filter(!all(value == 0) & row_number() >= which.max(value > 0))
)
Now we have the reverse scenario. The single filter is notably faster and more memory efficient.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
new | 208.6ms | 209.9ms | 12.9MB | 1 | 1 |
orig | 949.8ms | 949.8ms | 28.3MB | 4.53 | 2.19 |
Tidy timings
Overview
This tidy timings section comes from a notably old exploration I rediscovered (I think it was originally Jan 2020), but it looks like tidyfast still has some functionality beyond dtplyr, and it doesn’t hurt to revisit. I added a result for collapse. My original timings were on a nicely suped up pc, but the following are on a year and a half old macbook with an M1 processer, and were almost 2x faster.
Here I take a look at some timings for data processing tasks. My reason for doing so is that dtplyr has recently arisen from the dead, and tidyfast has come on the scene, so I wanted a quick reference for myself and others to see how things stack up against data.table.
So we have the following:
- Base R: Just kidding. If you’re using base R approaches for this
aggregate
you will always be slower. Functions likeaggregate
,tapply
and similar could be used in these demos, but I leave that as an exercise to the reader. I’ve done them, and it isn’t pretty. - dplyr: standard data wrangling workhorse package
- tidyr: has some specific functionality not included in dplyr
- data.table: another commonly used data processing package that purports to be faster and more memory efficient (usually but not always)
- tidyfast: can only do a few things, but does them quickly.
- collapse: many replacements for base R functions.
Standard grouped operation
The following demonstrates some timings from this post on stackoverflow. I reproduced it on my own machine based on 50 million observations. The grouped operations that are applied are just a sum and length on a vector. As this takes several seconds to do even once, I only do it one time.
set.seed(123)
= 5e7
n = 5e5
k = runif(n)
x = sample(k, n, TRUE)
grp
= list()
timing_group_by_big
# dplyr
"dplyr"]] = system.time({
timing_group_by_big[[= tibble(x, grp)
df = summarise(group_by(df, grp), sum(x), n())
r.dplyr
})
# dtplyr
"dtplyr"]] = system.time({
timing_group_by_big[[= lazy_dt(tibble(x, grp))
df = df %>% group_by(grp) %>% summarise(sum(x), n()) %>% collect()
r.dtplyr
})
# tidyfast
"tidyfast"]] = system.time({
timing_group_by_big[[= setnames(setDT(list(x, grp)), c("x","grp"))
dt = dt_count(dt, grp)
r.tidyfast
})
# data.table
"data.table"]] = system.time({
timing_group_by_big[[= setnames(setDT(list(x, grp)), c("x","grp"))
dt = dt[, .(sum(x), .N), grp]
r.data.table
})
# collapse
"collapse"]] = system.time({
timing_group_by_big[[= tibble(x, grp)
df = fsummarise(fgroup_by(df, grp), x = fsum(x), n = fnobs(x))
r.data.table
})
= timing_group_by_big %>%
timing_group_by_big do.call(rbind, .) %>%
data.frame() %>%
rownames_to_column('package')
package | elapsed |
---|---|
dplyr | 7.03 |
dtplyr | 1.30 |
collapse | 1.02 |
data.table | 0.67 |
tidyfast | 0.47 |
We can see that all options are notable improvements on dplyr. tidyfast is a little optimistic, as it can count but does not appear to do a summary operation like means or sums.
Count
To make things more evenly matched, we’ll just do a simple grouped count. In the following, I add a different option for dplyr if all we want are group sizes. In addition, you have to ‘collect’ the data for a dtplyr object, otherwise the resulting object is not actually a usable tibble, and we don’t want to count the timing until it actually performs the operation. You can do this with the collect
function or as_tibble
.
data(flights, package = 'nycflights13')
head(flights)
= lazy_dt(flights)
flights_dtp
= data.table(flights)
flights_dt
= bench::mark(
bm_count_flights dplyr_base = count(flights, arr_time),
dtplyr = collect(count(flights_dt, arr_time)),
tidyfast = dt_count(flights_dt, arr_time),
data.table = flights_dt[, .(n = .N), by = arr_time],
iterations = 100,
check = FALSE
)
Here are the results. It’s important to note the memory as well as the time. The faster functions here are taking a bit more memory to do it. If dealing with very large data this could be more important if operations timings aren’t too different.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
data.table | 2ms | 2.4ms | 9.07MB | 1 | 1.5 |
tidyfast | 2ms | 3.3ms | 9.05MB | 1.38 | 1.49 |
dplyr_gs | 3.7ms | 4ms | 6.06MB | 1.65 | 1 |
dtplyr | 3.5ms | 4.2ms | 9.06MB | 1.73 | 1.5 |
dplyr_base | 9.3ms | 10.5ms | 6.12MB | 4.31 | 1.01 |
Just for giggles I did the same in Python with a pandas DataFrame
, and depending on how you go about it you could be notably slower than all these methods, or less than half the standard dplyr approach. Unfortunately I can’t reproduce it here3, but I did run it on the same machine using a df.groupby().size()
approach to create the same type of data frame. Things get worse as you move to something not as simple, like summarizing with a custom function, even if that custom function is still simple arithmetic.
A lot of folks that use Python primarily still think R is slow, but that is mostly just a sign that they don’t know how to effectively program with R for data science. I know folks who use Python more, but also use tidyverse, and I use R more but also use pandas quite a bit. It’s not really a debate - tidyverse is easier, less verbose, and generally faster relative to pandas, especially for more complicated operations. If you start using tools like data.table, then there is really no comparison for speed and efficiency. You can run the following for comparison.
import pandas as pd
# flights = r.flights
= pd.read_parquet('data/flights.parquet')
flights
"arr_time", as_index=False).size() flights.groupby(
arr_time size
0 1.0 201
1 2.0 164
2 3.0 174
3 4.0 173
4 5.0 206
... ... ...
1406 2356.0 202
1407 2357.0 207
1408 2358.0 189
1409 2359.0 222
1410 2400.0 150
[1411 rows x 2 columns]
def test():
"arr_time", as_index=False).arr_time.count()
flights.groupby(
test()
import timeit
# see documentation timeit.timeit()
0.004073165997397155
= timeit.timeit(stmt="test()", setup="from __main__ import test", number = 100)
test_result
# default result is in seconds for the total number of 100 runs
/100*1000 # ms per run test_result
3.1555674999253824
Summary
Programming is a challenge, and programming in a computationally efficient manner is even harder. Depending on your situation, you may need to switch tools or just write your own to come up with the best solution.
Footnotes
data.table modifies in place, so it technically it doesn’t have anything to fill after the first run. As a comparison, I created new columns as the filled in values, and this made almost no speed/memory difference. I also tried
copy(dt_missing)[...]
, which had a minor speed hit. I also tried usingsetkey
first but that made no difference. Note also that data.table hassetnafill
, but this apparently has no grouping argument, so is not demonstrated.↩︎As of this writing, I’m new to the collapse package, and so might be missing other uses that might be more efficient.↩︎
This is because reticulate still has issues with M1 out of the box, and even then getting it to work can be a pain.↩︎
Reuse
Citation
@online{clark2022,
author = {Clark, Michael},
title = {Programming {Odds} \& {Ends}},
date = {2022-07-25},
url = {https://m-clark.github.io/posts/2022-07-25-programming/},
langid = {en}
}