Summarize data by groups

Provide common numeric summary information.

num_by(data, main_var, group_var, digits = 1, extra = FALSE)

cat_by(
  data,
  main_var,
  group_var,
  digits = FALSE,
  perc_by_group = TRUE,
  sort_by_group = TRUE
)

Arguments

data	data frame
main_var	the variable to be summarized. Multiple variables are supplied using the vars() function.
group_var	the (optional) grouping variable.
digits	Optional rounding. Default is 1.
extra	See num_summary.
perc_by_group	when supplied a grouping variable for cat_by, do you want within group percentages also (default is TRUE)
sort_by_group	when supplied a grouping variable for cat_by, do you want the result sorted on the grouping variable? Default is TRUE.

Value

data.frame/tibble with the corresponding summary statistics

Details

The num_by function takes a numeric variable from a dataframe and provides sample size, mean, standard deviation, min, first quartile, median, third quartile, max, and number of missing values, possibly over a grouping variable.

It works in the dplyr style using unquoted (bare) variable names, using the vars() function if there is more than one variable. If using a grouping variable, it will treat missing values as a separate group.

For cat_by, frequencies and percentage (out of total or group_var) are returned. Warnings are given if any of the main variables are more than 10 levels, or the data appears to be non-categorical. The group_var argument is essentially just to provide percentages based on a focal grouping variable out of all those supplied. Missing values are treated as an additional unique value.

Missing values are treated as separate groups, as it's often useful to explore the nature of the missingness. To avoid this, just use na.omit or dplyr::drop_na on the data frame first.

Examples

library(tidyext)
df1 <- data.frame(g1 = factor(sample(1:2, 50, replace = TRUE), labels=c('a','b')),
                  g2 = sample(1:4, 50, replace = TRUE),
                  a = rnorm(50),
                  b = rpois(50, 10),
                  c = sample(letters, 50, replace = TRUE),
                  d = sample(c(TRUE, FALSE), 50, replace = TRUE)
                 )


num_by(df1, main_var = a)
#> # A tibble: 1 x 10
#>   Variable     N  Mean    SD   Min    Q1 Median    Q3   Max `% Missing`
#>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>       <dbl>
#> 1 a           50  -0.1   1.1  -2.2  -0.9   -0.2   0.6     2           0
num_by(df1, main_var = a, group_var = g2, digits = 2)
#> Adding missing grouping variables: `g2`
#> Adding missing grouping variables: `g2`
#> # A tibble: 4 x 11
#> # Groups:   g2 [4]
#>      g2 Variable     N  Mean    SD   Min    Q1 Median    Q3   Max `% Missing`
#>   <int> <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>       <dbl>
#> 1     4 a           14 -0.17  0.87 -1.64 -0.82  -0.28  0.51  1.2            0
#> 2     3 a           10 -0.24  1.35 -2.18 -1.23  -0.1   0.4   1.74           0
#> 3     2 a           13  0.15  1.29 -2.12 -0.78   0.35  0.94  2.04           0
#> 4     1 a           13 -0.21  1.04 -1.87 -0.78  -0.22  0.39  1.96           0

num_by(df1, main_var = dplyr::vars(a,b), group_var = g1, digits=1)
#> Adding missing grouping variables: `g1`
#> Adding missing grouping variables: `g1`
#> # A tibble: 4 x 11
#> # Groups:   g1 [2]
#>   g1    Variable     N  Mean    SD   Min    Q1 Median    Q3   Max `% Missing`
#>   <fct> <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>       <dbl>
#> 1 b     a           19  -0.6   1.1  -2.2  -1.6   -0.8   0.1   1.6           0
#> 2 b     b           19   9.9   2.8   5     8.5   10    13    13             0
#> 3 a     a           31   0.2   1    -1.6  -0.7    0.3   0.7   2             0
#> 4 a     b           31  11.2   3.5   6     9     11    14.5  18             0

cat_by(df1, main_var = g1, group_var = g2, digits=1)
#> `mutate_if()` ignored the following grouping variables:
#> Column `g2`
#> # A tibble: 8 x 5
#> # Groups:   g2 [4]
#>      g2 g1        N `% of Total` `% of g2`
#>   <int> <fct> <dbl>        <dbl>     <dbl>
#> 1     1 a         8           16      61.5
#> 2     1 b         5           10      38.5
#> 3     2 a         9           18      69.2
#> 4     2 b         4            8      30.8
#> 5     3 a         5           10      50  
#> 6     3 b         5           10      50  
#> 7     4 a         9           18      64.3
#> 8     4 b         5           10      35.7
cat_by(df1, main_var = dplyr::vars(g1,d), group_var = g2, perc_by_group = FALSE)
#> # A tibble: 15 x 5
#>       g2 g1    d         N `% of Total`
#>    <int> <fct> <lgl> <int>        <dbl>
#>  1     1 a     FALSE     5           10
#>  2     1 a     TRUE      3            6
#>  3     1 b     FALSE     1            2
#>  4     1 b     TRUE      4            8
#>  5     2 a     FALSE     5           10
#>  6     2 a     TRUE      4            8
#>  7     2 b     TRUE      4            8
#>  8     3 a     FALSE     1            2
#>  9     3 a     TRUE      4            8
#> 10     3 b     FALSE     4            8
#> 11     3 b     TRUE      1            2
#> 12     4 a     FALSE     5           10
#> 13     4 a     TRUE      4            8
#> 14     4 b     FALSE     1            2
#> 15     4 b     TRUE      4            8

Arguments

Value

Details

See also

Examples