Provide common numeric summary information.

num_by(data, main_var, group_var, digits = 1, extra = FALSE)

cat_by(
  data,
  main_var,
  group_var,
  digits = FALSE,
  perc_by_group = TRUE,
  sort_by_group = TRUE
)

Arguments

data

data frame

main_var

the variable to be summarized. Multiple variables are supplied using the vars() function.

group_var

the (optional) grouping variable.

digits

Optional rounding. Default is 1.

extra

See num_summary.

perc_by_group

when supplied a grouping variable for cat_by, do you want within group percentages also (default is TRUE)

sort_by_group

when supplied a grouping variable for cat_by, do you want the result sorted on the grouping variable? Default is TRUE.

Value

data.frame/tibble with the corresponding summary statistics

Details

The num_by function takes a numeric variable from a dataframe and provides sample size, mean, standard deviation, min, first quartile, median, third quartile, max, and number of missing values, possibly over a grouping variable.

It works in the dplyr style using unquoted (bare) variable names, using the vars() function if there is more than one variable. If using a grouping variable, it will treat missing values as a separate group.

For cat_by, frequencies and percentage (out of total or group_var) are returned. Warnings are given if any of the main variables are more than 10 levels, or the data appears to be non-categorical. The group_var argument is essentially just to provide percentages based on a focal grouping variable out of all those supplied. Missing values are treated as an additional unique value.

Missing values are treated as separate groups, as it's often useful to explore the nature of the missingness. To avoid this, just use na.omit or dplyr::drop_na on the data frame first.

See also

describe_all

Examples

library(tidyext) df1 <- data.frame(g1 = factor(sample(1:2, 50, replace = TRUE), labels=c('a','b')), g2 = sample(1:4, 50, replace = TRUE), a = rnorm(50), b = rpois(50, 10), c = sample(letters, 50, replace = TRUE), d = sample(c(TRUE, FALSE), 50, replace = TRUE) ) num_by(df1, main_var = a)
#> # A tibble: 1 x 10 #> Variable N Mean SD Min Q1 Median Q3 Max `% Missing` #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 a 50 -0.1 1.1 -2.2 -0.9 -0.2 0.6 2 0
num_by(df1, main_var = a, group_var = g2, digits = 2)
#> Adding missing grouping variables: `g2`
#> Adding missing grouping variables: `g2`
#> # A tibble: 4 x 11 #> # Groups: g2 [4] #> g2 Variable N Mean SD Min Q1 Median Q3 Max `% Missing` #> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 4 a 14 -0.17 0.87 -1.64 -0.82 -0.28 0.51 1.2 0 #> 2 3 a 10 -0.24 1.35 -2.18 -1.23 -0.1 0.4 1.74 0 #> 3 2 a 13 0.15 1.29 -2.12 -0.78 0.35 0.94 2.04 0 #> 4 1 a 13 -0.21 1.04 -1.87 -0.78 -0.22 0.39 1.96 0
num_by(df1, main_var = dplyr::vars(a,b), group_var = g1, digits=1)
#> Adding missing grouping variables: `g1`
#> Adding missing grouping variables: `g1`
#> # A tibble: 4 x 11 #> # Groups: g1 [2] #> g1 Variable N Mean SD Min Q1 Median Q3 Max `% Missing` #> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 b a 19 -0.6 1.1 -2.2 -1.6 -0.8 0.1 1.6 0 #> 2 b b 19 9.9 2.8 5 8.5 10 13 13 0 #> 3 a a 31 0.2 1 -1.6 -0.7 0.3 0.7 2 0 #> 4 a b 31 11.2 3.5 6 9 11 14.5 18 0
cat_by(df1, main_var = g1, group_var = g2, digits=1)
#> `mutate_if()` ignored the following grouping variables: #> Column `g2`
#> # A tibble: 8 x 5 #> # Groups: g2 [4] #> g2 g1 N `% of Total` `% of g2` #> <int> <fct> <dbl> <dbl> <dbl> #> 1 1 a 8 16 61.5 #> 2 1 b 5 10 38.5 #> 3 2 a 9 18 69.2 #> 4 2 b 4 8 30.8 #> 5 3 a 5 10 50 #> 6 3 b 5 10 50 #> 7 4 a 9 18 64.3 #> 8 4 b 5 10 35.7
cat_by(df1, main_var = dplyr::vars(g1,d), group_var = g2, perc_by_group = FALSE)
#> # A tibble: 15 x 5 #> g2 g1 d N `% of Total` #> <int> <fct> <lgl> <int> <dbl> #> 1 1 a FALSE 5 10 #> 2 1 a TRUE 3 6 #> 3 1 b FALSE 1 2 #> 4 1 b TRUE 4 8 #> 5 2 a FALSE 5 10 #> 6 2 a TRUE 4 8 #> 7 2 b TRUE 4 8 #> 8 3 a FALSE 1 2 #> 9 3 a TRUE 4 8 #> 10 3 b FALSE 4 8 #> 11 3 b TRUE 1 2 #> 12 4 a FALSE 5 10 #> 13 4 a TRUE 4 8 #> 14 4 b FALSE 1 2 #> 15 4 b TRUE 4 8