Combinations to columns — combn_2

Convert a character or factor of multiple labels.

combn_2_col(
  data,
  var,
  sep = "[^[:alnum:]]+",
  max_m = 1,
  collapse = "_",
  toInteger = FALSE,
  sparse = FALSE
)

Arguments

data	The data frame in question.
var	The quoted name for the variable in question. The variable can be character or factor.
sep	The label separator, for example a comma or space.
max_m	The maximum number of possible combinations. Default is 1.
collapse	In the names of the new columns, how do you want the label combinations separated?
toInteger	Convert the logical result to integers of 0,1.
sparse	Return only the new indicators as a sparse matrix?

Value

A data frame with the new indicator columns, or a sparse matrix of only the indicator columns.

Details

This comes up every once in a while. Someone has for whatever reason coded multiple labels into cells within a single column, and now you need those individual labels for analysis. This function will create indicator columns for every combination of labels up to max_m labels. It will also return a list column, called 'combo', a version of the original, but where the entries are more usable vectors of labels, which might be useful for further processing.

Note that the number of possible combinations grows very quickly when there are many unique labels, well more than your machine can handle, so use sensible values. Check with combn(n_unique_labels, n_combinations) if you think there might be an issue.

Usually this situation is a result of poor data entry, and you'll likely need to do a little text pre-processing just to get started.

This can actually be used for one hot encoding if max_m is set to 1, though I'll make a more efficient version of that process in a later function. The combo column becomes superfluous in this case.

If you don't need combinations and each cell has the same pattern of entry, you could use tidyr::separate.

I tested this against a model.matrix approach and two text-analysis approaches (see examples), and with a problem that was notably more sizeable than the examples. Using model.matrix wasn't viable with even that size, and surprisingly, a simple tidytext approach was consistently fastest. However, this implementation is parallelizable in two parts, and requires nothing beyond what comes with a base R installation, so it wins.

Examples

library(tidyext)

d = data.frame(id = 1:4, labs = c('A/B', 'B/C/D/E', 'A/E', 'D/E'))
test = combn_2_col(data = d, var = 'labs', max_m = 3)
test
#>   id    labs
#> 1  1     A/B
#> 2  2 B/C/D/E
#> 3  3     A/E
#> 4  4     D/E
#>                                                                                              combo
#> 1                                                                                        A, B, A_B
#> 2 B, C, D, E, B_C, B_D, B_E, C_D, C_E, D_E, B_C_D, B_C_E, B_D_E, C_D_E, B_C_D, B_C_E, B_D_E, C_D_E
#> 3                                                                                        A, E, A_E
#> 4                                                                                        D, E, D_E
#>       A     B   A_B     C     D     E   B_C   B_D   B_E   C_D   C_E   D_E B_C_D
#> 1  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 2 FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> 3  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 4 FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>   B_C_E B_D_E C_D_E   A_E
#> 1 FALSE FALSE FALSE FALSE
#> 2  TRUE  TRUE  TRUE FALSE
#> 3 FALSE FALSE FALSE  TRUE
#> 4 FALSE FALSE FALSE FALSE
str(test)
#> 'data.frame':	4 obs. of  20 variables:
#>  $ id   : int  1 2 3 4
#>  $ labs : chr  "A/B" "B/C/D/E" "A/E" "D/E"
#>  $ combo:List of 4
#>   ..$ : chr  "A" "B" "A_B"
#>   ..$ : chr  "B" "C" "D" "E" ...
#>   ..$ : chr  "A" "E" "A_E"
#>   ..$ : chr  "D" "E" "D_E"
#>  $ A    : logi  TRUE FALSE TRUE FALSE
#>  $ B    : logi  TRUE TRUE FALSE FALSE
#>  $ A_B  : logi  TRUE FALSE FALSE FALSE
#>  $ C    : logi  FALSE TRUE FALSE FALSE
#>  $ D    : logi  FALSE TRUE FALSE TRUE
#>  $ E    : logi  FALSE TRUE TRUE TRUE
#>  $ B_C  : logi  FALSE TRUE FALSE FALSE
#>  $ B_D  : logi  FALSE TRUE FALSE FALSE
#>  $ B_E  : logi  FALSE TRUE FALSE FALSE
#>  $ C_D  : logi  FALSE TRUE FALSE FALSE
#>  $ C_E  : logi  FALSE TRUE FALSE FALSE
#>  $ D_E  : logi  FALSE TRUE FALSE TRUE
#>  $ B_C_D: logi  FALSE TRUE FALSE FALSE
#>  $ B_C_E: logi  FALSE TRUE FALSE FALSE
#>  $ B_D_E: logi  FALSE TRUE FALSE FALSE
#>  $ C_D_E: logi  FALSE TRUE FALSE FALSE
#>  $ A_E  : logi  FALSE FALSE TRUE FALSE

d$labs =  c('A B', 'B C D E', 'A E', 'D E')
combn_2_col(data = d, var = 'labs', max_m = 1)
#>   id    labs                                          combo     A     B     C
#> 1  1     A B                                     A, B, A, B  TRUE  TRUE FALSE
#> 2  2 B C D E B, C, D, E, B, C, D, E, B, C, D, E, B, C, D, E FALSE  TRUE  TRUE
#> 3  3     A E                                     A, E, A, E  TRUE FALSE FALSE
#> 4  4     D E                                     D, E, D, E FALSE FALSE FALSE
#>       D     E
#> 1 FALSE FALSE
#> 2  TRUE  TRUE
#> 3 FALSE  TRUE
#> 4  TRUE  TRUE

d$labs =  c('Tom, Dick & Harriet', "J'Sean", "OBG, Andreas", NA)

combn_2_col(
data = d,
var = 'labs',
sep = ',',
max_m = 2,
collapse = '-'
)
#>   id                labs                                     combo   Tom
#> 1  1 Tom, Dick & Harriet Tom,  Dick & Harriet, Tom- Dick & Harriet  TRUE
#> 2  2              J'Sean                                    J'Sean FALSE
#> 3  3        OBG, Andreas               OBG,  Andreas, OBG- Andreas FALSE
#> 4  4                <NA>                                        NA FALSE
#>    Dick & Harriet Tom- Dick & Harriet J'Sean   OBG  Andreas OBG- Andreas    NA
#> 1            TRUE                TRUE  FALSE FALSE    FALSE        FALSE FALSE
#> 2           FALSE               FALSE   TRUE FALSE    FALSE        FALSE FALSE
#> 3           FALSE               FALSE  FALSE  TRUE     TRUE         TRUE FALSE
#> 4           FALSE               FALSE  FALSE FALSE    FALSE        FALSE  TRUE