Convert a character or factor of multiple labels.

combn_2_col(
  data,
  var,
  sep = "[^[:alnum:]]+",
  max_m = 1,
  collapse = "_",
  toInteger = FALSE,
  sparse = FALSE
)

Arguments

data

The data frame in question.

var

The quoted name for the variable in question. The variable can be character or factor.

sep

The label separator, for example a comma or space.

max_m

The maximum number of possible combinations. Default is 1.

collapse

In the names of the new columns, how do you want the label combinations separated?

toInteger

Convert the logical result to integers of 0,1.

sparse

Return only the new indicators as a sparse matrix?

Value

A data frame with the new indicator columns, or a sparse matrix of only the indicator columns.

Details

This comes up every once in a while. Someone has for whatever reason coded multiple labels into cells within a single column, and now you need those individual labels for analysis. This function will create indicator columns for every combination of labels up to max_m labels. It will also return a list column, called 'combo', a version of the original, but where the entries are more usable vectors of labels, which might be useful for further processing.

Note that the number of possible combinations grows very quickly when there are many unique labels, well more than your machine can handle, so use sensible values. Check with combn(n_unique_labels, n_combinations) if you think there might be an issue.

Usually this situation is a result of poor data entry, and you'll likely need to do a little text pre-processing just to get started.

This can actually be used for one hot encoding if max_m is set to 1, though I'll make a more efficient version of that process in a later function. The combo column becomes superfluous in this case.

If you don't need combinations and each cell has the same pattern of entry, you could use tidyr::separate.

I tested this against a model.matrix approach and two text-analysis approaches (see examples), and with a problem that was notably more sizeable than the examples. Using model.matrix wasn't viable with even that size, and surprisingly, a simple tidytext approach was consistently fastest. However, this implementation is parallelizable in two parts, and requires nothing beyond what comes with a base R installation, so it wins.

Examples

library(tidyext) d = data.frame(id = 1:4, labs = c('A/B', 'B/C/D/E', 'A/E', 'D/E')) test = combn_2_col(data = d, var = 'labs', max_m = 3) test
#> id labs #> 1 1 A/B #> 2 2 B/C/D/E #> 3 3 A/E #> 4 4 D/E #> combo #> 1 A, B, A_B #> 2 B, C, D, E, B_C, B_D, B_E, C_D, C_E, D_E, B_C_D, B_C_E, B_D_E, C_D_E, B_C_D, B_C_E, B_D_E, C_D_E #> 3 A, E, A_E #> 4 D, E, D_E #> A B A_B C D E B_C B_D B_E C_D C_E D_E B_C_D #> 1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #> 2 FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE #> 3 TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #> 4 FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE #> B_C_E B_D_E C_D_E A_E #> 1 FALSE FALSE FALSE FALSE #> 2 TRUE TRUE TRUE FALSE #> 3 FALSE FALSE FALSE TRUE #> 4 FALSE FALSE FALSE FALSE
str(test)
#> 'data.frame': 4 obs. of 20 variables: #> $ id : int 1 2 3 4 #> $ labs : chr "A/B" "B/C/D/E" "A/E" "D/E" #> $ combo:List of 4 #> ..$ : chr "A" "B" "A_B" #> ..$ : chr "B" "C" "D" "E" ... #> ..$ : chr "A" "E" "A_E" #> ..$ : chr "D" "E" "D_E" #> $ A : logi TRUE FALSE TRUE FALSE #> $ B : logi TRUE TRUE FALSE FALSE #> $ A_B : logi TRUE FALSE FALSE FALSE #> $ C : logi FALSE TRUE FALSE FALSE #> $ D : logi FALSE TRUE FALSE TRUE #> $ E : logi FALSE TRUE TRUE TRUE #> $ B_C : logi FALSE TRUE FALSE FALSE #> $ B_D : logi FALSE TRUE FALSE FALSE #> $ B_E : logi FALSE TRUE FALSE FALSE #> $ C_D : logi FALSE TRUE FALSE FALSE #> $ C_E : logi FALSE TRUE FALSE FALSE #> $ D_E : logi FALSE TRUE FALSE TRUE #> $ B_C_D: logi FALSE TRUE FALSE FALSE #> $ B_C_E: logi FALSE TRUE FALSE FALSE #> $ B_D_E: logi FALSE TRUE FALSE FALSE #> $ C_D_E: logi FALSE TRUE FALSE FALSE #> $ A_E : logi FALSE FALSE TRUE FALSE
d$labs = c('A B', 'B C D E', 'A E', 'D E') combn_2_col(data = d, var = 'labs', max_m = 1)
#> id labs combo A B C #> 1 1 A B A, B, A, B TRUE TRUE FALSE #> 2 2 B C D E B, C, D, E, B, C, D, E, B, C, D, E, B, C, D, E FALSE TRUE TRUE #> 3 3 A E A, E, A, E TRUE FALSE FALSE #> 4 4 D E D, E, D, E FALSE FALSE FALSE #> D E #> 1 FALSE FALSE #> 2 TRUE TRUE #> 3 FALSE TRUE #> 4 TRUE TRUE
d$labs = c('Tom, Dick & Harriet', "J'Sean", "OBG, Andreas", NA) combn_2_col( data = d, var = 'labs', sep = ',', max_m = 2, collapse = '-' )
#> id labs combo Tom #> 1 1 Tom, Dick & Harriet Tom, Dick & Harriet, Tom- Dick & Harriet TRUE #> 2 2 J'Sean J'Sean FALSE #> 3 3 OBG, Andreas OBG, Andreas, OBG- Andreas FALSE #> 4 4 <NA> NA FALSE #> Dick & Harriet Tom- Dick & Harriet J'Sean OBG Andreas OBG- Andreas NA #> 1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE #> 2 FALSE FALSE TRUE FALSE FALSE FALSE FALSE #> 3 FALSE FALSE FALSE TRUE TRUE TRUE FALSE #> 4 FALSE FALSE FALSE FALSE FALSE FALSE TRUE