Provide common numeric summary information.
num_by(data, main_var, group_var, digits = 1, extra = FALSE) cat_by( data, main_var, group_var, digits = FALSE, perc_by_group = TRUE, sort_by_group = TRUE )
data | data frame |
---|---|
main_var | the variable to be summarized. Multiple variables are supplied using the vars() function. |
group_var | the (optional) grouping variable. |
digits | Optional rounding. Default is 1. |
extra | See num_summary. |
perc_by_group | when supplied a grouping variable for cat_by, do you want within group percentages also (default is TRUE) |
sort_by_group | when supplied a grouping variable for cat_by, do you want the result sorted on the grouping variable? Default is TRUE. |
data.frame/tibble with the corresponding summary statistics
The num_by
function takes a numeric variable from a dataframe
and provides sample size, mean, standard deviation, min, first quartile,
median, third quartile, max, and number of missing values, possibly over a
grouping variable.
It works in the dplyr style using unquoted (bare) variable names, using the
vars()
function if there is more than one variable. If using a
grouping variable, it will treat missing values as a separate group.
For cat_by
, frequencies and percentage (out of total or group_var)
are returned. Warnings are given if any of the main
variables are more than 10 levels, or the data appears to be
non-categorical. The group_var argument is essentially just to provide
percentages based on a focal grouping variable out of all those supplied.
Missing values are treated as an additional unique value.
Missing values are treated as separate groups, as it's often useful to
explore the nature of the missingness. To avoid this, just use
na.omit
or dplyr::drop_na
on the data frame first.
describe_all
library(tidyext) df1 <- data.frame(g1 = factor(sample(1:2, 50, replace = TRUE), labels=c('a','b')), g2 = sample(1:4, 50, replace = TRUE), a = rnorm(50), b = rpois(50, 10), c = sample(letters, 50, replace = TRUE), d = sample(c(TRUE, FALSE), 50, replace = TRUE) ) num_by(df1, main_var = a)#> # A tibble: 1 x 10 #> Variable N Mean SD Min Q1 Median Q3 Max `% Missing` #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 a 50 -0.1 1.1 -2.2 -0.9 -0.2 0.6 2 0num_by(df1, main_var = a, group_var = g2, digits = 2)#>#>#> # A tibble: 4 x 11 #> # Groups: g2 [4] #> g2 Variable N Mean SD Min Q1 Median Q3 Max `% Missing` #> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 4 a 14 -0.17 0.87 -1.64 -0.82 -0.28 0.51 1.2 0 #> 2 3 a 10 -0.24 1.35 -2.18 -1.23 -0.1 0.4 1.74 0 #> 3 2 a 13 0.15 1.29 -2.12 -0.78 0.35 0.94 2.04 0 #> 4 1 a 13 -0.21 1.04 -1.87 -0.78 -0.22 0.39 1.96 0#>#>#> # A tibble: 4 x 11 #> # Groups: g1 [2] #> g1 Variable N Mean SD Min Q1 Median Q3 Max `% Missing` #> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 b a 19 -0.6 1.1 -2.2 -1.6 -0.8 0.1 1.6 0 #> 2 b b 19 9.9 2.8 5 8.5 10 13 13 0 #> 3 a a 31 0.2 1 -1.6 -0.7 0.3 0.7 2 0 #> 4 a b 31 11.2 3.5 6 9 11 14.5 18 0cat_by(df1, main_var = g1, group_var = g2, digits=1)#>#>#> # A tibble: 8 x 5 #> # Groups: g2 [4] #> g2 g1 N `% of Total` `% of g2` #> <int> <fct> <dbl> <dbl> <dbl> #> 1 1 a 8 16 61.5 #> 2 1 b 5 10 38.5 #> 3 2 a 9 18 69.2 #> 4 2 b 4 8 30.8 #> 5 3 a 5 10 50 #> 6 3 b 5 10 50 #> 7 4 a 9 18 64.3 #> 8 4 b 5 10 35.7#> # A tibble: 15 x 5 #> g2 g1 d N `% of Total` #> <int> <fct> <lgl> <int> <dbl> #> 1 1 a FALSE 5 10 #> 2 1 a TRUE 3 6 #> 3 1 b FALSE 1 2 #> 4 1 b TRUE 4 8 #> 5 2 a FALSE 5 10 #> 6 2 a TRUE 4 8 #> 7 2 b TRUE 4 8 #> 8 3 a FALSE 1 2 #> 9 3 a TRUE 4 8 #> 10 3 b FALSE 4 8 #> 11 3 b TRUE 1 2 #> 12 4 a FALSE 5 10 #> 13 4 a TRUE 4 8 #> 14 4 b FALSE 1 2 #> 15 4 b TRUE 4 8