Explorations in faster data processing and other problems.
Oftentimes I’m looking to gain speed/memory advantages, or maybe just exploring how to do the same thing differently in case it becomes useful later on. I thought I’d start posting them, but then I don’t get around to it. Here are a few that have come up over the past year (or two 😞), and even as I wrote this, more examples kept coming up, so I may come back to add more in the future.
Data processing efficiency really depends on context. Faster doesn’t mean memory efficient, and what may be the best in a standard setting can be notably worse in others. Some approaches won’t show their value until the data is very large, or there are many groups to deal with, while others will get notably worse. Also, you may not want an additional package dependency beyond what you’re using, and may need a base R approach. The good news is you’ll always have options!
One caveat: I’m not saying that the following timed approaches are necessarily the best/fastest, I mostly stuck to ones I’d try first. You may find even better for your situation! A great resource to keep in mind is the fastverse, which is a collection of packages with speed and efficiency in mind, and includes a couple that are demonstrated here.
Required packages.
Also, in the following bench marks I turn off checking if the results
are equivalent (i.e. check = FALSE
) because even if the
resulting data is the same, the objects may be different classes, or
some objects may even be of the same class, but have different
attributes. You are welcome to double check that you would get the same
thing as I did. Also, you might want to look at the
autoplot
of the bench mark summaries, as many results have
some variability that isn’t captured by just looking at median/best
times.
We’ll start with the problem of filling in missing values by group. I’ve created a realistic example where the missingness is seen randomly across multiple columns, and differently across groups. I’ve chosen to compare tidyr, tidyfast, the underlying approach of tidyr via vctrs, and data.table.
Note that only data.table is not last
observation carried forward by default (‘down’ in tidyr parlance), so that argument is made explicit.
All of these objects will have different attributes or classes. tidyfast for some reason renames the grouping
variable to ‘by’. If you wrap all of these in data.frame
,
that will remove the attributes and give them all the same class, so you
can verify they return the same result.
set.seed(1234)
= 1e5
N = 5000
Ng
<- function(x) {
create_missing sample(1:length(x), 5)] = NA
x[
x
}
= tibble(grp = rep(1:Ng, e = N / Ng)) %>%
df_missing arrange(grp) %>%
group_by(grp) %>%
mutate(
x = 1:n(),
y = rpois(n(), 5),
z = rnorm(n(), 5),
across(x:z, create_missing)
%>%
) ungroup()
%>% head(5)
df_missing
= as.data.table(df_missing) %>% setkey(grp)
dt_missing = copy(dt_missing)
tf_missing
<-
bm_fill ::mark(
bench%!in%
tidyr = fill(group_by(df_missing, grp), x:z),
tidyfast = dt_fill(tf_missing, x, y, z, id = grp),
vctrs = df_missing %>% group_by(grp) %>% mutate(across(x:z, vec_fill_missing)),
dt = dt_missing[
,x = nafill(x, type = 'locf'),
.(y = nafill(y, type = 'locf'),
z = nafill(z, type = 'locf')),
by = grp
],check = FALSE,
iterations = 10
)
This is a great example of where there is a notable speed/memory trade-off. Very surprising how much memory data.table uses1, while not giving much speed advantage relative to the tidyr. Perhaps there is something I’m missing (😏)? Also note that we can get an even ‘tidier’ advantage by using vctrs directly, rather than wrapping it via tidyr, and seeing how easy it is to use, it’s probably the best option.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
tidyfast | 18.3ms | 24.5ms | 39.87MB | 1 | 6.06 |
vctrs | 27ms | 28.5ms | 6.58MB | 1.16 | 1 |
dt | 111.7ms | 117.8ms | 237.08MB | 4.81 | 36.04 |
tidyr | 132.3ms | 147.2ms | 6.59MB | 6.01 | 1 |
Sometimes you just don’t want that! In the following we have a
situation where we want to filter values based on the negation of some
condition. Think of a case where certain person IDs are not viable for
consideration for analysis. Many times, a natural approach would be to
use something like a filter where instead of using
vals %in% values_desired
, we just negate that with a bang
(!
) operator. However, another approach is to create a data
frame of the undesired values and use an anti_join
. When
using joins in general, you get a relative advantage by explicitly
noting the variables you’re joining on, so I compare that as well for
demonstration. Finally, in this particular example we could use data.table’s built-in character match,
chin
.
set.seed(123)
df1 = tibble(
id = sample(letters, 10000, replace = TRUE)
)
df2 = tibble(
id = sample(letters[1:10], 10000, replace = TRUE)
)
df1_lazy = lazy_dt(df1)
df2_lazy = lazy_dt(df2)
df1_dt = data.table(df1)
df2_dt = data.table(df2)
suppressMessages({
bm_antijoin = bench::mark(
in_ = filter(df1, !id %in% letters[1:10]),
in_dtp = collect(filter(df1_lazy, !id %in% letters[1:10])), # not usable until collected/as_tibbled
chin = filter(df1, !id %chin% letters[1:10]), # chin for char vector only, from data.table
chin_dt = df1_dt[!df1_dt$id %chin% letters[1:10],],
coll = fsubset(df1, id %!in% letters[1:10]), # can work with dt or tidyverse
aj = anti_join(df1, df2, by = 'id'),
aj_noby = anti_join(df1, df2),
iterations = 100,
check = FALSE
)
})
In this case, the fully data.table approach
is best in speed and memory, but collapse is
not close behind2. In addition, if you are in the
tidyverse, the anti_join
function is a very good option.
Hopefully the lesson about explicitly setting the by
argument is made clear.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
chin_dt | 152µs | 160µs | 206KB | 1 | 1 |
coll | 167µs | 185µs | 268KB | 1.16 | 1.3 |
aj | 529µs | 559µs | 293KB | 3.5 | 1.42 |
chin | 626µs | 662µs | 270KB | 4.15 | 1.31 |
in_ | 716µs | 804µs | 387KB | 5.04 | 1.88 |
in_dtp | 923µs | 988µs | 479KB | 6.2 | 2.33 |
aj_noby | 9.95ms | 10.962ms | 323KB | 68.72 | 1.57 |
Here we are interested in getting the difference in the current value
of some feature from it’s last (or next) value, typically called a lag
(lead). Note that it doesn’t have to be the last value, but that is most
common. In the tidyverse we have lag
/lead
functions, or with data.table, we have the
generic shift
function that can do both. In the following I
look at using that function in the fully data.table situation or within a tibble.
set.seed(1234)
N = 1e5
Ng = 5000
df = tibble(
x = rpois(N, 10),
grp = rep(1:Ng, e = N / Ng)
)
dt = as.data.table(df)
bm_lag = bench::mark(
dplyr_lag = mutate(df, x_diff = x - lag(x)),
dplyr_lead = mutate(df, x_diff = x - lead(x)),
dt_lag = dt[, x_diff := x - shift(x)],
dt_lead = dt[, x_diff := x - shift(x, n = -1)],
dt_dp_lag = mutate(df, x_diff = x - shift(x)),
dt_dp_lead = mutate(df, x_diff = x - shift(x, n = -1)),
coll_lag = ftransform(df, x_diff = x - flag(x)),
coll_lead = ftransform(df, x_diff = x - flag(x, n = -1)),
iterations = 100,
check = FALSE
)
In this case, collapse is best, with data.table not far behind, but using the
shift
function within the tidy approach is a very solid
gain. Oddly, lag
and lead
seem somewhat
different in terms of speed and memory.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
coll_lag | 226µs | 260µs | 783.84KB | 1 | 1 |
coll_lead | 250µs | 309µs | 783.84KB | 1.18 | 1 |
dt_lag | 316µs | 354µs | 813.87KB | 1.36 | 1.04 |
dt_lead | 319µs | 355µs | 813.87KB | 1.36 | 1.04 |
dt_dp_lag | 792µs | 818µs | 782.91KB | 3.14 | 1 |
dt_dp_lead | 797µs | 892µs | 782.91KB | 3.42 | 1 |
dplyr_lead | 994µs | 1.115ms | 1.53MB | 4.28 | 2 |
dplyr_lag | 1.044ms | 1.2ms | 1.15MB | 4.61 | 1.5 |
What about a grouped scenario? To keep it simple we’ll just look at using lagged values.
In the grouped situation, using a collapse isn’t best for memory, but the speed gain is ridiculous!!
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
coll_lag | 484µs | 548µs | 1.59MB | 1 | 3.52 |
dt_lag | 25.692ms | 28.067ms | 462.32KB | 51.2 | 1 |
dt_dp_lag | 30.843ms | 32.969ms | 2.61MB | 60.15 | 5.77 |
dplyr_lag | 77.156ms | 78.771ms | 5.19MB | 143.7 | 11.49 |
In this demo, we want to take the first (last) value of each group. Surprisingly, for the same functionality, it turns out that the number of groups matter when doing groupwise operations. For the following I’ll even use a base R approach (though within dplyr’s mutate) to demonstrate some differences.
set.seed(1234)
N = 100000
df = tibble(
x = rpois(N, 10),
id = sample(1:100, N, replace = TRUE)
)
dt = as.data.table(df)
bm_first = bench::mark(
base_first = summarize(group_by(df, id), x = x[1]),
base_last = summarize(group_by(df, id), x = x[length(x)]),
dplyr_first = summarize(group_by(df, id), x = dplyr::first(x)),
dplyr_last = summarize(group_by(df, id), x = dplyr::last(x)),
dt_first = dt[, .(x = data.table::first(x)), by = id],
dt_last = dt[, .(x = data.table::last(x)), by = id],
coll_first = ffirst(fgroup_by(df, id)),
coll_last = flast(fgroup_by(df, id)),
iterations = 100,
check = FALSE
)
The first result is actually not too surprising, in that the fully dt
approaches are fast and memory efficient, though collapse is notably faster. Somewhat interesting is
that the base last is a bit faster than dplyr’s last
(technically
nth
) approach.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
coll_last | 307µs | 341µs | 783.53KB | 1 | 1.72 |
coll_first | 299µs | 345µs | 783.53KB | 1.01 | 1.72 |
dt_last | 698µs | 876µs | 454.62KB | 2.57 | 1 |
dt_first | 689µs | 895µs | 454.62KB | 2.62 | 1 |
base_last | 1.944ms | 2.036ms | 2.06MB | 5.97 | 4.64 |
dplyr_first | 2.026ms | 2.127ms | 2.06MB | 6.24 | 4.64 |
base_first | 1.984ms | 2.165ms | 2.06MB | 6.35 | 4.64 |
dplyr_last | 2.111ms | 2.553ms | 2.06MB | 7.49 | 4.64 |
In the following, the only thing that changes is the number of groups.
set.seed(1234)
N = 100000
df = tibble(
x = rpois(N, 10),
id = sample(1:(N/10), N, replace = TRUE) # <--- change is here
)
dt = as.data.table(df)
bm_first_more_groups = bench::mark(
base_first = summarize(group_by(df, id), x = x[1]),
base_last = summarize(group_by(df, id), x = x[length(x)]),
dplyr_first = summarize(group_by(df, id), x = dplyr::first(x)),
dplyr_last = summarize(group_by(df, id), x = dplyr::last(x)),
dt_first = dt[, .(x = data.table::first(x)), by = id],
dt_last = dt[, .(x = data.table::last(x)), by = id],
coll_first = ffirst(group_by(df, id)),
coll_last = flast(group_by(df, id)),
iterations = 100,
check = FALSE
)
Now what the heck is going on here? The base R approach is way faster than even data.table, while not using any more memory than what dplyr is doing (because of the group-by-summarize). More to the point is that collapse is notably faster than the other options, but still a bit heavy memory-wise relative to data.table.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
coll_last | 2.813ms | 3.006ms | 2.79MB | 1 | 3.8 |
coll_first | 2.838ms | 3.128ms | 2.79MB | 1.04 | 3.8 |
base_first | 10.596ms | 11.4ms | 3.06MB | 3.79 | 4.17 |
base_last | 12.551ms | 13.151ms | 3.06MB | 4.37 | 4.17 |
dt_last | 17.588ms | 18.318ms | 752.15KB | 6.09 | 1 |
dt_first | 17.671ms | 18.379ms | 752.15KB | 6.11 | 1 |
dplyr_first | 20.066ms | 20.943ms | 3.06MB | 6.97 | 4.17 |
dplyr_last | 20.916ms | 21.357ms | 3.16MB | 7.1 | 4.31 |
It’s very often we want to change a single value based on some
condition, often starting with ifelse
. This is similar to
our previous fill situation for missing values, but applies a constant
as opposed to last/next value. Coalesce
is similar to tidyr’s fill
, and is often used in
cases where we might otherwise use an ifelse
style approach
. In the following, we want to change NA values to zero, and there are
many ways we might go about it.
set.seed(1234)
x = rnorm(1000)
x[x > 2] = NA
bm_coalesce = bench::mark(
base = {x[is.na(x)] <- 0; x},
ifelse = ifelse(is.na(x), 0, x),
if_else = if_else(is.na(x), 0, x),
vctrs = vec_assign(x, is.na(x), 0),
tidyr = replace_na(x, 0),
fifelse = fifelse(is.na(x), 0, x),
coalesce = coalesce(x, 0),
fcoalesce = fcoalesce(x, 0),
nafill = nafill(x, fill = 0),
coll = replace_NA(x) # default is 0
)
The key result here to me is just how much memory the dplyr if_else
approach is using, as
well as how fast and memory efficient the base R approach is even with a
second step. While providing type safety, if_else
is both
slow and a memory hog, so probably anything else is better. tidyr itself would be a good option here, and while
it makes up for the memory issue, it’s relatively slow compared to other
approaches, including the function it’s a wrapper for
(vec_assign
), which is also demonstrated. Interestingly,
fcoalesce
and fifelse
would both be better
options than data.table’s other approach that
is explicitly for this task.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
fcoalesce | 1µs | 1µs | 7.86KB | 1 | 1 |
coll | 1µs | 1µs | 7.86KB | 1.03 | 1 |
base | 2µs | 3µs | 7.91KB | 2.17 | 1.01 |
fifelse | 3µs | 4µs | 11.81KB | 2.6 | 1.5 |
vctrs | 4µs | 4µs | 11.81KB | 3.09 | 1.5 |
nafill | 5µs | 6µs | 23.95KB | 4.14 | 3.05 |
tidyr | 9µs | 10µs | 11.81KB | 7.09 | 1.5 |
coalesce | 15µs | 17µs | 20.19KB | 11.71 | 2.57 |
ifelse | 15µs | 17µs | 47.31KB | 12.11 | 6.02 |
if_else | 21µs | 25µs | 71.06KB | 17.54 | 9.04 |
I recently had a problem where I wanted to do a apply a certain
function that required taking the difference between the current and
last value as we did in the lag demo. The problem was that ‘last’
depended on a specific condition being met. The basic idea is that we
want to take x - lag(x) but where the condition is FALSE
,
we need to basically ignore that value for consideration as the last
value, and only use the previous value for which the condition is
TRUE
. In the following, for the first two values where the
condition is met, this is straightforward (6 minus 10). But for the
fourth row, 4 should subtract 6, rather than 5, because the condition is
FALSE
.
set.seed(1234)
df = tibble(
x = sample(1:10),
cond = c(TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE),
group = rep(c('a', 'b'), e = 5)
)
df
# A tibble: 10 × 3
x cond group
<int> <lgl> <chr>
1 10 TRUE a
2 6 TRUE a
3 5 FALSE a
4 4 TRUE a
5 1 FALSE a
6 8 TRUE b
7 2 FALSE b
8 7 FALSE b
9 9 TRUE b
10 3 FALSE b
While somewhat simple in concept, it doesn’t really work with simple
lags, as the answer would be wrong, or sliding functions, because the
window is adaptive. I wrote the following function to deal with this. By
default, it basically takes our vector under consideration,
x
, makes it NA
where the condition doesn’t
hold, then fills in the NA values with the last value using the
vec_fill_missing
(or a supplied constant/single value).
However there is flexibility beyond that type of fill. In addition, the
function applied is generic, and could be applied to the newly created
variable (.x
), or use both the original (x
)
and the newly created variable.
conditional_slide <-
function(x,
condition,
fun,
direction = c("down"),
fill_value = NA,
na_value = NA,
...) {
if (!direction %in% c("constant", "down", "up", "downup", "updown"))
rlang::abort('direction must be one of "constant", "down", "up", "downup", "updown"')
if (length(x) != length(condition))
rlang::abort('condition and x must be the same length')
# can't use dplyr/dt ifelse since we won't know class type of fill_value
conditional_val <- ifelse(direction == 'constant', fill_value, NA)
.x <- ifelse(condition, x, conditional_val)
if (direction != 'constant')
.x <- vctrs::vec_fill_missing(.x, direction = direction)
class(.x) <- class(x)
result <- fun(x, .x, ...)
if (!is.na(na_value))
result[is.na(result)] <- na_value
result
}
The first example applies the function, x - lag(x)
, to
our dataset, and which in my case, I also wanted to apply within groups,
which caused further problems for some of the available functions I
thought would otherwise be applicable. I also show it for another type
of problem, taking the cumulative sum, as well as just conditionally
changing the values to zero.
df %>%
group_by(group) %>%
mutate(
# demo first difference
simple_diff = x - dplyr::lag(x),
cond_diff = conditional_slide(x, cond, fun = \(x, .x) x - lag(.x), na_value = 0),
# demo cumulative sum
simple_cumsum = cumsum(x),
cond_cumsum = conditional_slide(
x,
cond,
fun = \(x, .x) cumsum(.x),
direction = 'constant',
fill = 0
),
# demo fill last
simple_fill_last = vec_fill_missing(x),
cond_fill_last = conditional_slide(
x,
cond,
fun = \(x, .x) .x,
direction = 'down'
)
)
# A tibble: 10 × 9
# Groups: group [2]
x cond group simple_diff cond_diff simple_cumsum cond_cumsum simple_fill_last cond_fill_last
<int> <lgl> <chr> <int> <dbl> <int> <int> <int> <int>
1 10 TRUE a NA 0 10 10 10 10
2 6 TRUE a -4 -4 16 16 6 6
3 5 FALSE a -1 -1 21 16 5 6
4 4 TRUE a -1 -2 25 20 4 4
5 1 FALSE a -3 -3 26 20 1 4
6 8 TRUE b NA 0 8 8 8 8
7 2 FALSE b -6 -6 10 8 2 8
8 7 FALSE b 5 -1 17 8 7 8
9 9 TRUE b 2 1 26 17 9 9
10 3 FALSE b -6 -6 29 17 3 9
This is one of those things that comes up from time to time where trying to apply a standard tool likely won’t cut it. You may find similar situations where you need to modify what’s available and create some functionality tailored to your needs.
Sometimes we want the first instance of a condition. For example, we
might want the position or value of the first number > than some
value. We’ve already investigated using dplyr
or data.table’s first
, and I
won’t do so again here except to say they are both notably slower and
worse on memory here. We have a few approaches we might take in base R.
Using which
would be common, but there is also
which.max
, that, when applied to logical vectors, gives the
position of the first TRUE
(which.min
gives
the position of the first FALSE
). In addition, there is the
Position
function, which I didn’t even know about until
messing with this problem.
set.seed(123)
x = sort(rnorm(10000))
marker = 2
bm_first_true_1 = bench::mark(
which = which(x > marker)[1],
which_max = which.max(x > marker),
pos = Position(\(x) x > marker, x)
)
# make it slightly more challenging
x = sort(rnorm(1e6))
marker = 4
bm_first_true_2 = bench::mark(
which = which(x > marker)[1],
which_max = which.max(x > marker),
pos = Position(\(x) x > marker, x)
)
Interestingly Position
provides the best memory
performance, but is prohibitively slower. which.max
is
probably your best bet.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
which | 15µs | 20µs | 79.14KB | 1 | 14.19 |
which_max | 18µs | 20µs | 39.11KB | 1.01 | 7.01 |
pos | 1.986ms | 2.158ms | 5.58KB | 109.67 | 1 |
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
which | 1.44ms | 1.69ms | 7.63MB | 1 | 1400.58 |
which_max | 1.79ms | 2ms | 3.81MB | 1.18 | 700.29 |
pos | 213.55ms | 219.13ms | 5.58KB | 129.4 | 1 |
But not so fast? The following makes the first case come very
quickly, where Position
blows the other options out of the
water! I guess if you knew this was going to be the case you could take
serious advantage.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
pos_rev | 10µs | 10µs | 5.58KB | 1 | 1 |
which_max_rev | 110µs | 130µs | 390.67KB | 13.04 | 70.04 |
which_rev | 160µs | 220µs | 1.14MB | 22.55 | 210.09 |
The previous situation was the basis for this next demo where we
utilize which.max
. Here we want to filter in one scenario,
such that if all values are zero, we drop them, and in the second, we
want to only retain certain values based on a condition. In this latter
case, the condition is that at least one non-zero has occurred, in which
case we want to keep all of those values from that point on (even if
they are zero).
To make things more clear, for the example data that follows, we want to drop group 1 entirely, the initial part of group 2, and retain all of group 3.
library(tidyverse)
set.seed(12345)
df = tibble(
group = rep(1:3, e = 10),
value = c(
rep(0, 10),
c(rep(0, 3), sample(0:5, 7, replace = TRUE)),
sample(0:10, 10, replace = TRUE)
)
)
df %>%
group_by(group) %>%
filter(!all(value == 0)) %>%
slice(which.max(value > 0):n())
# A tibble: 17 × 2
# Groups: group [2]
group value
<int> <dbl>
1 2 5
2 2 2
3 2 1
4 2 3
5 2 1
6 2 4
7 2 2
8 3 7
9 3 1
10 3 5
11 3 10
12 3 5
13 3 6
14 3 9
15 3 0
16 3 7
17 3 6
In the above scenario, we take two steps to illustrate our desired
outcome conceptually. Ideally though, we’d like one step, because it is
just a general filtering. You might think maybe to change
which.max
to which
and just slice, but this
would remove all zeros, when we want to retain zeros after the point
where at least some values are greater than zero. Using
row_number
was a way I thought to get around things.
df %>%
group_by(group) %>%
filter(!all(value == 0) & row_number() >= which.max(value > 0))
bm_filter_slice = bench::mark(
orig = df %>%
group_by(group) %>%
filter(!all(value == 0)) %>%
slice(which.max(value > 0):n()),
new = df %>%
group_by(group) %>%
filter(!all(value == 0) & row_number() >= which.max(value > 0))
)
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
orig | 2ms | 2.2ms | 10.25KB | 1 | 1.11 |
new | 6.1ms | 7.1ms | 9.21KB | 3.28 | 1 |
Well we got it to one operation, but now it takes longer and has no memory advantage. Are we on the wrong track? Let’s try with a realistically sized data set with a lot of groups.
set.seed(1234)
N = 100000
g = 1:(N/4)
df = tibble(
group = rep(g, e = 4),
value = sample(0:5, size = N, replace = TRUE)
)
bm_filter_slice2 = bench::mark(
orig = df %>%
group_by(group) %>%
filter(!all(value == 0)) %>%
slice(which.max(value > 0):n()),
new = df %>%
group_by(group) %>%
filter(!all(value == 0) & row_number() >= which.max(value > 0))
)
Now we have the reverse scenario. The single filter is notably faster and more memory efficient.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
new | 208.6ms | 209.9ms | 12.9MB | 1 | 1 |
orig | 949.8ms | 949.8ms | 28.3MB | 4.53 | 2.19 |
This tidy timings section comes from a notably old exploration I rediscovered (I think it was originally Jan 2020), but it looks like tidyfast still has some functionality beyond dtplyr, and it doesn’t hurt to revisit. I added a result for collapse. My original timings were on a nicely suped up pc, but the following are on a year and a half old macbook with an M1 processer, and were almost 2x faster.
Here I take a look at some timings for data processing tasks. My reason for doing so is that dtplyr has recently arisen from the dead, and tidyfast has come on the scene, so I wanted a quick reference for myself and others to see how things stack up against data.table.
So we have the following:
aggregate
you will always be slower. Functions like
aggregate
, tapply
and similar could be used in
these demos, but I leave that as an exercise to the reader. I’ve done
them, and it isn’t pretty.The following demonstrates some timings from this post on stackoverflow. I reproduced it on my own machine based on 50 million observations. The grouped operations that are applied are just a sum and length on a vector. As this takes several seconds to do even once, I only do it one time.
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)
timing_group_by_big = list()
# dplyr
timing_group_by_big[["dplyr"]] = system.time({
df = tibble(x, grp)
r.dplyr = summarise(group_by(df, grp), sum(x), n())
})
# dtplyr
timing_group_by_big[["dtplyr"]] = system.time({
df = lazy_dt(tibble(x, grp))
r.dtplyr = df %>% group_by(grp) %>% summarise(sum(x), n()) %>% collect()
})
# tidyfast
timing_group_by_big[["tidyfast"]] = system.time({
dt = setnames(setDT(list(x, grp)), c("x","grp"))
r.tidyfast = dt_count(dt, grp)
})
# data.table
timing_group_by_big[["data.table"]] = system.time({
dt = setnames(setDT(list(x, grp)), c("x","grp"))
r.data.table = dt[, .(sum(x), .N), grp]
})
# collapse
timing_group_by_big[["collapse"]] = system.time({
df = tibble(x, grp)
r.data.table = fsummarise(fgroup_by(df, grp), x = fsum(x), n = fnobs(x))
})
timing_group_by_big = timing_group_by_big %>%
do.call(rbind, .) %>%
data.frame() %>%
rownames_to_column('package')
package | elapsed |
---|---|
dplyr | 7.03 |
dtplyr | 1.30 |
collapse | 1.02 |
data.table | 0.67 |
tidyfast | 0.47 |
We can see that all options are notable improvements on dplyr. tidyfast is a little optimistic, as it can count but does not appear to do a summary operation like means or sums.
To make things more evenly matched, we’ll just do a simple grouped
count. In the following, I add a different option for dplyr if all we want are group sizes. In addition,
you have to ‘collect’ the data for a dtplyr
object, otherwise the resulting object is not actually a usable tibble,
and we don’t want to count the timing until it actually performs the
operation. You can do this with the collect
function or
as_tibble
.
data(flights, package = 'nycflights13')
head(flights)
flights_dtp = lazy_dt(flights)
flights_dt = data.table(flights)
bm_count_flights = bench::mark(
dplyr_base = count(flights, arr_time),
dtplyr = collect(count(flights_dt, arr_time)),
tidyfast = dt_count(flights_dt, arr_time),
data.table = flights_dt[, .(n = .N), by = arr_time],
iterations = 100,
check = FALSE
)
Here are the results. It’s important to note the memory as well as the time. The faster functions here are taking a bit more memory to do it. If dealing with very large data this could be more important if operations timings aren’t too different.
expression | min | median | mem_alloc | median_relative | mem_relative |
---|---|---|---|---|---|
data.table | 2ms | 2.4ms | 9.07MB | 1 | 1.5 |
tidyfast | 2ms | 3.3ms | 9.05MB | 1.38 | 1.49 |
dplyr_gs | 3.7ms | 4ms | 6.06MB | 1.65 | 1 |
dtplyr | 3.5ms | 4.2ms | 9.06MB | 1.73 | 1.5 |
dplyr_base | 9.3ms | 10.5ms | 6.12MB | 4.31 | 1.01 |
Just for giggles I did the same in Python with a pandas DataFrame
, and depending
on how you go about it you could be notably slower than all these
methods, or less than half the standard dplyr
approach. Unfortunately I can’t reproduce it here3,
but I did run it on the same machine using a
df.groupby().size()
approach to create the same type of
data frame. Things get worse as you move to something not as simple,
like summarizing with a custom function, even if that custom function is
still simple arithmetic.
A lot of folks that use Python primarily still think R is slow, but that is mostly just a sign that they don’t know how to effectively program with R for data science. I know folks who use Python more, but also use tidyverse, and I use R more but also use pandas quite a bit. It’s not really a debate - tidyverse is easier, less verbose, and generally faster relative to pandas, especially for more complicated operations. If you start using tools like data.table, then there is really no comparison for speed and efficiency.
import pandas as pd
# flights = r.flights
= pd.read_parquet('~/Repos/m-clark.github.io/data/flights.parquet')
flights
"arr_time", as_index=False).size()
flights.groupby(
def test():
"arr_time", as_index=False).arr_time.count()
flights.groupby(
test()
import timeit
# see documentation
timeit.timeit()
= timeit.timeit(stmt="test()", setup="from __main__ import test", number = 100)
test_result
# default result is in seconds for the total number of 100 runs
/100*1000 # ms per run test_result
Programming is a challenge, and programming in a computationally efficient manner is even harder. Depending on your situation, you may need to switch tools or just write your own to come up with the best solution.
data.table
modifies in place, so it technically it doesn’t have anything to fill
after the first run. As a comparison, I created new columns as the
filled in values, and this made almost no speed/memory difference. I
also tried copy(dt_missing)[...]
, which had a minor speed
hit. I also tried using setkey
first but that made no
difference. Note also that data.table has
setnafill
, but this apparently has no grouping argument, so
is not demonstrated.↩︎
As of this writing, I’m new to the collapse package, and so might be missing other uses that might be more efficient.↩︎
This is because reticulate still has issues with M1 out of the box, and even then getting it to work can be a pain.↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com//m-clark/m-clark.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Clark (2022, July 25). Michael Clark: Programming Odds & Ends. Retrieved from https://m-clark.github.io/posts/2022-07-25-programming/
BibTeX citation
@misc{clark2022programming, author = {Clark, Michael}, title = {Michael Clark: Programming Odds & Ends}, url = {https://m-clark.github.io/posts/2022-07-25-programming/}, year = {2022} }