Add indicators for all desired variables in a data set.

onehot(
  data,
  var = NULL,
  nas = "na.pass",
  sparse = FALSE,
  keep.original = FALSE
)

Arguments

data

A data frame

var

A character string/vector of names to be encoded. If NULL, the default, all character and factor variables will be encoded.

nas

What to do with missing values. For na.omit and na.exclude, any observations with missing data will be removed from the result. With na.pass, the default, the result will retain the missing values. Otherwise, with na.fail, an error will be thrown.

sparse

Logical (default FALSE). If true, will return only the encoded variables as a sparse matrix.

keep.original

Logical (default FALSE). Keep the original variables? Not an option if sparse is TRUE.

Value

A data.frame with the encoded variables, or a sparse matrix of only the encoded variables.

Details

This function is a simple one-hot encoder, with a couple options that are commonly desired. Takes the applicable variables and creates a binary indicator column for each unique value. If supplied non-factor/character variables, it will coerce them to characters and proceed accordingly. Will handle missingness, return a sparse matrix, or keep the original variable(s) as desired.

See also

Examples

library(tidyext) str(onehot(iris, keep.original = TRUE))
#> 'data.frame': 150 obs. of 8 variables: #> $ Sepal.Length : num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... #> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... #> $ Petal.Length : num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... #> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... #> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... #> $ Species_setosa : num 1 1 1 1 1 1 1 1 1 1 ... #> $ Species_versicolor: num 0 0 0 0 0 0 0 0 0 0 ... #> $ Species_virginica : num 0 0 0 0 0 0 0 0 0 0 ...
str(onehot(iris, sparse = TRUE))
#> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots #> ..@ i : int [1:150] 0 1 2 3 4 5 6 7 8 9 ... #> ..@ p : int [1:4] 0 50 100 150 #> ..@ Dim : int [1:2] 150 3 #> ..@ Dimnames:List of 2 #> .. ..$ : chr [1:150] "1" "2" "3" "4" ... #> .. ..$ : chr [1:3] "Species_xsetosa" "Species_xversicolor" "Species_xvirginica" #> ..@ x : num [1:150] 1 1 1 1 1 1 1 1 1 1 ... #> ..@ factors : list()
str(onehot(mtcars, var = c('vs','cyl')))
#> #> You have supplied numeric variables. #> Attempts were made to keep the #> column names consistent, but you'll want to check.
#> 'data.frame': 32 obs. of 14 variables: #> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... #> $ disp : num 160 160 108 258 360 ... #> $ hp : num 110 110 93 110 175 105 245 62 95 123 ... #> $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... #> $ wt : num 2.62 2.88 2.32 3.21 3.44 ... #> $ qsec : num 16.5 17 18.6 19.4 17 ... #> $ am : num 1 1 1 0 0 0 0 0 0 0 ... #> $ gear : num 4 4 4 3 3 3 3 4 4 4 ... #> $ carb : num 4 4 1 1 2 1 4 2 2 4 ... #> $ cyl_4: num 0 0 1 0 0 0 0 1 1 0 ... #> $ cyl_6: num 1 1 0 1 0 1 0 0 0 1 ... #> $ cyl_8: num 0 0 0 0 1 0 1 0 0 0 ... #> $ vs_0 : num 1 1 0 0 1 0 1 0 0 0 ... #> $ vs_1 : num 0 0 1 1 0 1 0 1 1 1 ...
iris2 = iris iris2[sample(1:150, 25),] = NA str(onehot(iris2))
#> 'data.frame': 150 obs. of 7 variables: #> $ Sepal.Length : num 5.1 4.9 NA NA 5 NA 4.6 NA 4.4 4.9 ... #> $ Sepal.Width : num 3.5 3 NA NA 3.6 NA 3.4 NA 2.9 3.1 ... #> $ Petal.Length : num 1.4 1.4 NA NA 1.4 NA 1.4 NA 1.4 1.5 ... #> $ Petal.Width : num 0.2 0.2 NA NA 0.2 NA 0.3 NA 0.2 0.1 ... #> $ Species_setosa : num 1 1 NA NA 1 NA 1 NA 1 1 ... #> $ Species_versicolor: num 0 0 NA NA 0 NA 0 NA 0 0 ... #> $ Species_virginica : num 0 0 NA NA 0 NA 0 NA 0 0 ...
str(onehot(iris2, nas = 'na.omit'))
#> 'data.frame': 125 obs. of 7 variables: #> $ Sepal.Length : num 5.1 4.9 5 4.6 4.4 4.9 4.8 4.8 4.3 5.8 ... #> $ Sepal.Width : num 3.5 3 3.6 3.4 2.9 3.1 3.4 3 3 4 ... #> $ Petal.Length : num 1.4 1.4 1.4 1.4 1.4 1.5 1.6 1.4 1.1 1.2 ... #> $ Petal.Width : num 0.2 0.2 0.2 0.3 0.2 0.1 0.2 0.1 0.1 0.2 ... #> $ Species_setosa : num 1 1 1 1 1 1 1 1 1 1 ... #> $ Species_versicolor: num 0 0 0 0 0 0 0 0 0 0 ... #> $ Species_virginica : num 0 0 0 0 0 0 0 0 0 0 ...