Describe your data cleanly and effectively

Describe data sets with multiple variable types effectively.

describe_all(
  data,
  digits = 2,
  include_NAcat = TRUE,
  max_levels = 10,
  include_numeric = FALSE,
  NAcat_include = NULL,
  sort_by_freq = TRUE,
  ...
)

describe_all_num(data, digits = 2, ...)

describe_all_cat(
  data,
  digits = 2,
  include_NAcat = TRUE,
  max_levels = 10,
  include_numeric = FALSE,
  sort_by_freq = TRUE,
  as_ordered = FALSE
)

Arguments

data	The dataset, of class data.frame.
digits	See [base::round()]. Default is 2, which for categorical is applied to the proportion (i.e. before converting to percentage).
include_NAcat	Include NA values as categorical levels? Default is `TRUE`.
max_levels	The maximum number of levels you want to display for categorical variables. Default is 10.
include_numeric	For categorical summary, also include numeric variables with unique values fewer or equal to `max_levels`? Default is `FALSE`.
NAcat_include	Deprecated alias of `include_NAcat`.
sort_by_freq	Sort categorical levels by frequency? Default is `TRUE`.
...	Additional arguments passed to `num_summary`
as_ordered	Return the categorical results with the levels as ordered. See details and example.

Value

A list with two elements of summaries for numeric and other variables respectively. Or the contents of those elements if the type-specific functions are used.

Details

This function comes out of my frustrations from various data set summaries either being inadequate for my needs, too 'busy' with output, or unable to deal well with mixed data types.

Numeric data is treated separately from categorical, and provides the same information as in num_summary.

Categorical variables are defined as those with class 'character', 'factor', 'logical', 'ordered', combined with include_numeric. They are are summarized with frequencies and percentages. For empty categorical variables (e.g. after a subset), a warning is thrown. Note that max_levels is used with top_n, and so will return additional values when there are ties.

The as_ordered argument is to get around the notorious alphabetical ordering of ggplot. It returns a data.frame where the 'data' column contains the frequency information of the categorical levels, while leaving the levels in order (e.g. decreasing if sort_by_freq was TRUE). This way you can directly plot the result in the manner you've actually requested. See the example.

The functions describe_all_num and describe_all_cat will provide only numeric or only categorical data summaries respectively. describeAll is a deprecated alias.

Examples

library(tidyext); library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
X = data.frame(f1 =gl(2, 1, 20, labels=c('A', 'B')),
               f2=gl(2, 2, 20, labels=c('X', 'Q')))
X = X %>% mutate(bin1 = rbinom(20, 1, p=.5),
                 logic1 = sample(c(TRUE, FALSE), 20, replace = TRUE),
                 num1 = rnorm(20),
                 num2 = rpois(20, 5),
                 char1 = sample(letters, 20, replace = TRUE))
describe_all(X)
#> $`Numeric Variables`
#> # A tibble: 3 x 10
#>   Variable     N  Mean    SD   Min    Q1 Median    Q3   Max `% Missing`
#>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>       <dbl>
#> 1 bin1        20  0.3   0.47  0     0      0     1     1              0
#> 2 num1        20  0.01  0.99 -1.91 -0.44   0.05  0.54  1.89           0
#> 3 num2        20  4.95  2.42  1     3      5     6    12              0
#> 
#> $`Categorical Variables`
#> # A tibble: 19 x 4
#>    Variable Group Frequency   `%`
#>    <chr>    <fct>     <int> <dbl>
#>  1 f1       A            10    50
#>  2 f1       B            10    50
#>  3 f2       Q            10    50
#>  4 f2       X            10    50
#>  5 logic1   TRUE         12    60
#>  6 logic1   FALSE         8    40
#>  7 char1    a             2    10
#>  8 char1    e             2    10
#>  9 char1    f             2    10
#> 10 char1    g             2    10
#> 11 char1    l             2    10
#> 12 char1    m             2    10
#> 13 char1    q             2    10
#> 14 char1    b             1     5
#> 15 char1    n             1     5
#> 16 char1    r             1     5
#> 17 char1    s             1     5
#> 18 char1    v             1     5
#> 19 char1    z             1     5
#> 

describe_all(data.frame(x=factor(1:7)), digits=5)
#> No numeric data.
#> $`Numeric Variables`
#> NULL
#> 
#> $`Categorical Variables`
#> # A tibble: 7 x 4
#>   Variable Group Frequency   `%`
#>   <chr>    <fct>     <int> <dbl>
#> 1 x        1             1  14.3
#> 2 x        2             1  14.3
#> 3 x        3             1  14.3
#> 4 x        4             1  14.3
#> 5 x        5             1  14.3
#> 6 x        6             1  14.3
#> 7 x        7             1  14.3
#> 
describe_all(mtcars, digits=5, include_numeric=TRUE, max_levels=3)
#> $`Numeric Variables`
#> # A tibble: 11 x 10
#>    Variable     N    Mean      SD   Min     Q1 Median     Q3    Max `% Missing`
#>    <chr>    <dbl>   <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>       <dbl>
#>  1 mpg         32  20.1     6.03  10.4   15.4   19.2   22.8   33.9            0
#>  2 cyl         32   6.19    1.79   4      4      6      8      8              0
#>  3 disp        32 231.    124.    71.1  121.   196.   326    472              0
#>  4 hp          32 147.     68.6   52     96.5  123    180    335              0
#>  5 drat        32   3.60    0.535  2.76   3.08   3.70   3.92   4.93           0
#>  6 wt          32   3.22    0.978  1.51   2.58   3.32   3.61   5.42           0
#>  7 qsec        32  17.8     1.79  14.5   16.9   17.7   18.9   22.9            0
#>  8 vs          32   0.438   0.504  0      0      0      1      1              0
#>  9 am          32   0.406   0.499  0      0      0      1      1              0
#> 10 gear        32   3.69    0.738  3      3      4      4      5              0
#> 11 carb        32   2.81    1.62   1      2      2      4      8              0
#> 
#> $`Categorical Variables`
#> # A tibble: 10 x 4
#>    Variable Group Frequency   `%`
#>    <chr>    <fct>     <int> <dbl>
#>  1 cyl      8            14  43.8
#>  2 cyl      4            11  34.4
#>  3 cyl      6             7  21.9
#>  4 vs       0            18  56.2
#>  5 vs       1            14  43.8
#>  6 am       0            19  59.4
#>  7 am       1            13  40.6
#>  8 gear     3            15  46.9
#>  9 gear     4            12  37.5
#> 10 gear     5             5  15.6
#> 

library(ggplot2)
freqs = describe_all_cat(mtcars,
                         include_numeric=TRUE,
                         max_levels=3,
                         as_ordered = TRUE)
freqs %>%
  filter(Variable == 'cyl') %>%
  tidyr::unnest() %>%
  ggplot(aes(x=Group, y=`%`)) +
  geom_point(size = 10)
#> Warning: `cols` is now required when using unnest().
#> Please use `cols = c(data)`

Arguments

Value

Details

See also

Examples