Exploratory Data Analysis

Introduction

In R many tools are available to help you dive in and explore your data. However, in consulting I still see a lot of people using base R’s table and summary functions, followed by a lot of work to get the result into a more presentable format. My own frustrations led to me creating a package (tidyext) for personal use in this area. While that suits me fine for most of my needs, there are tools that can go much further with little effort. Recently, Staniak & Biecek Staniak and Biecek (2019) wrote an article in the R Journal exploring several of such packages, so I thought I’d try them out for myself, and take others along with me for that ride.

Note that these are first impressions, and I haven’t really dived deeply into any of the packages, and may be missing some key features (apologies to the package authors!). But that is also part of the point for this sort of thing. These aren’t modeling packages, and we have a good idea of what we want in EDA, so these should be easy to pick up and use.

Packages

I’ve updated the article’s table 1, which adds roughly a year’s worth of downloads.

package	downloads	debut
summarytools	198878	2014-08-11
DataExplorer	186078	2016-03-01
visdat	184528	2017-07-11
funModeling	101996	2016-02-07
arsenal	76017	2016-12-30
dlookr	44393	2018-04-27
dataMaid	42749	2017-01-02
inspectdf	23246	2019-04-24
xray	19376	2017-11-22
RtutoR	16251	2016-03-12
ExPanDaR	15423	2018-05-11
exploreR	14433	2016-02-10
SmartEDA	13641	2018-04-06
explore	10090	2019-05-16

A more informative assessment of usage would be in average monthly downloads.

package	average_monthly_downloads
visdat	5125.778
DataExplorer	3578.423
summarytools	2801.099
funModeling	1924.453
arsenal	1767.837
dlookr	1644.185
inspectdf	1549.733
dataMaid	1017.833
explore	720.714
xray	605.500
ExPanDaR	593.192
SmartEDA	505.222
RtutoR	312.519
exploreR	272.321

Here is a visualization of their growth over time.

Selected packages

Arbitrary criteria

I’ll outline my reasons for selecting some packages to explore and not others. These reasons are somewhat arbitrary, but not necessarily, and may leave out some viable newer packages. For the data scenario, I am assuming messy data of the sort that might have hundreds of columns of mixed data types, potentially with lots of missingness, attributes that are only applicable to subsets of the data (e.g. branching logic in surveys), etc.¹

Here is my criteria for selection:

Relatively more widely used
I’ve used it before and want to revisit it
I’ve heard of it before and want to try it
I’m familiar with the package author’s work elsewhere
The package appears to be in active development (especially on GitHub) according to modern programming standards
Good documentation

Things I’m not as concerned about:

Anything that requires statistical analysis. I have no interest in bivariate statistical tests, PCA, imputation. It likely isn’t appropriate for the exploratory setting, and I could do better via other means.
Visuals, though fundamental for data exploration, can usually be done better with even a modicum of effort, so this wouldn’t be a big factor for me, and some common ones are poor for communication (e.g. histograms). However, if it saves me some time for initial presentations or something like that, all the better.

In the end, I would like a package that is well put together, will make common tasks easier for me, and potentially save me time in creating reports/presentation.

Conceptual organization

Staniak & Biecek note two general phases of data exploration, each with specific tasks, based on the CRISP-DM Wirth and Hipp (2000) standard.²

Understanding
- Description
- Validity
- Exploration
Preparation
- Cleaning
- Derived Attributes

In the first place, I will focus on tools for understanding, particularly description and validity, as they refer to exploration tasks solely as visualization, which is a perk, but something I’m more inclined to do myself.

I’m definitely less interested in data preparation though I will speak about it more towards the end. ‘Cleaning’ in the article refers to mean/median imputation, something I’ve never bothered to do for reasons that have been noted in the statistical literature for a very long time. The other transformations are easy, and probably should be more explicitly documented in your code. Furthermore, in creating derived attributes, things like category merging and standardization depend on the subset of the data used, so it would probably be better to be more explicit than automatic. Also, if things like an automated PCA is viable for your situation, it probably is very simple data (i.e. all variables of the same type), in which case, most of these tools probably won’t add much value to you anyway.

So with that in mind here are the ones I will explore (in alphabetical order):

arsenal
DataExplorer
dataMaid
gtsummary
janitor (not explored in the previous article)
SmartEDA
summarytools
visdat

The Example Dataset

I’ve chosen the heart disease data originally available from the UCI repository (object name hd). It contains a mixture of data types but isn’t too unwieldy, as it’s already been cleaned and has few columns (it’s from my noiris package). For our purposes, I’ve additionally added some random missingness, and created an hd_sample which has only a couple columns to cut down on the display. The full RData file for hd can be found here.

Data Description

For data description, we are interested in things like the dimensions of the data, variable types, and maybe even meta-data, like the object’s size in RAM. I also add univariate data summaries to this list, as it’s something you’d always want to know and usually report, though these are thought of as data exploration in the article’s delineation.

Preliminary

To give a sense of what my preferences are, consider my own functions. The following actually calls a separate numerical summary function as well as a categorical variable function, and both return ‘tidy’ data frames that can immediately be used for presentation (e.g. via kableExtra) and visualization (e.g. ggplot2), or drilling down to only selections of the output.


library(tidyext)

describe_all(hd)  # also describe_all_num, describe_all_cat


$`Numeric Variables`
# A tibble: 8 x 10
  Variable          N   Mean    SD   Min    Q1 Median    Q3   Max `% Missing`
  <chr>         <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>       <dbl>
1 age             283  54.7   9       29    48   56    61    77             7
2 resting_bp      286 132.   17.5     94   120  130   140   200             6
3 cholesterol     291 247.   52.1    126   211  241   274   564             4
4 resting_ecg     287   0.53  0.53     0     0    1     1     2             5
5 max_heartrate   297 150.   23.0     71   133  153   166   202             2
6 old_peak        289   1.06  1.17     0     0    0.8   1.6   6.2           5
7 n_vessels       288   0.73  1.02     0     0    0     1     4             5
8 heart_disease   289   0.56  0.5      0     0    1     1     1             5

$`Categorical Variables`
# A tibble: 22 x 4
   Variable            Group            Frequency   `%`
   <chr>               <fct>                <int> <dbl>
 1 sex                 male                   194   64 
 2 sex                 female                  91   30 
 3 sex                 <NA>                    18    6 
 4 chest_pain_type     typical angina         140   46 
 5 chest_pain_type     non-anginal pain        83   27 
 6 chest_pain_type     atypical angina         47   16 
 7 chest_pain_type     asymptomatic            21    7.
 8 chest_pain_type     <NA>                    12    4 
 9 fasting_blood_sugar lt_120                 247   82 
10 fasting_blood_sugar gt_120                  40   13 
# … with 12 more rows

I also have options, such as the following.


hd %>% 
  select(age, sex, heart_disease) %>% 
  describe_all(
  digits = 2,              # round
  include_NAcat = FALSE,   # don't include NA as a category
  include_numeric = TRUE,  # allows numeric variables with few levels given in max_levels argument as categorical
  max_levels = 3,          # include numeric data with only a few values in the categorical result
  sort_by_freq = TRUE,     # sort categorical data by frequency
  extra = TRUE             # additional numeric output, e.g. N distinct values
)


$`Numeric Variables`
# A tibble: 2 x 12
  Variable          N  Mean    SD   Min    Q1 Median    Q3   Max `% Missing` Distinct Zeros
  <chr>         <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>       <dbl>    <dbl> <dbl>
1 age             283 54.7    9      29    48     56    61    77           7       41     0
2 heart_disease   289  0.56   0.5     0     0      1     1     1           5        2   128

$`Categorical Variables`
# A tibble: 4 x 4
  Variable      Group  Frequency   `%`
  <chr>         <fct>      <int> <dbl>
1 sex           male         194   68 
2 sex           female        91   32 
3 heart_disease 1            161   56.
4 heart_disease 0            128   44

For grouped output, I can use the underlying functions. Here we look at summaries for a couple numerical variables by sex.


hd %>% 
  num_by(
    main_var  = vars(age, cholesterol),
    group_var = sex,
    extra = TRUE
  )


# A tibble: 6 x 13
# Groups:   sex [3]
  sex    Variable        N  Mean    SD   Min    Q1 Median    Q3   Max `% Missing` Distinct Zeros
  <chr>  <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>       <dbl>    <dbl> <dbl>
1 male   age           179  54.4   8.8    29  48     55    60      77           8       38     0
2 male   cholesterol   183 242.   42.1   131 212    236   270     353           6      109     0
3 female age            88  56     9.2    34  50     57    63      76           3       34     0
4 female cholesterol    90 263    65.9   141 219.   255   301     564           1       76     0
5 <NA>   age            16  51.2   9.1    35  45.2   52.5  56.2    71          11       16     0
6 <NA>   cholesterol    18 215.   43.5   126 198.   214.  242.    302           0       18     0

The categorical variable functionality is very similar.


hd %>% 
  cat_by(
    main_var  = chest_pain_type,
    group_var = sex
  )


# A tibble: 14 x 5
# Groups:   sex [3]
   sex    chest_pain_type      N `% of Total` `% of sex`
   <chr>  <chr>            <int>        <dbl>      <dbl>
 1 female asymptomatic         3        0.990       3.30
 2 female atypical angina     15        4.95       16.5 
 3 female non-anginal pain    34       11.2        37.4 
 4 female typical angina      36       11.9        39.6 
 5 female <NA>                 3        0.990       3.30
 6 male   asymptomatic        17        5.61        8.76
 7 male   atypical angina     28        9.24       14.4 
 8 male   non-anginal pain    44       14.5        22.7 
 9 male   typical angina      96       31.7        49.5 
10 male   <NA>                 9        2.97        4.64
11 <NA>   asymptomatic         1        0.330       5.56
12 <NA>   atypical angina      4        1.32       22.2 
13 <NA>   non-anginal pain     5        1.65       27.8 
14 <NA>   typical angina       8        2.64       44.4

As these are tidy tibbles, they are essentially ready for presentation.


describe_all_num(hd) %>% 
  kableExtra::kable()

Variable	N	Mean	SD	Min	Q1	Median	Q3	Max	% Missing
age	283	54.68	9.00	29	48	56.0	61.0	77.0	7
resting_bp	286	131.60	17.47	94	120	130.0	140.0	200.0	6
cholesterol	291	246.57	52.13	126	211	241.0	274.0	564.0	4
resting_ecg	287	0.53	0.53	0	0	1.0	1.0	2.0	5
max_heartrate	297	149.62	23.05	71	133	153.0	166.0	202.0	2
old_peak	289	1.06	1.17	0	0	0.8	1.6	6.2	5
n_vessels	288	0.73	1.02	0	0	0.0	1.0	4.0	5
heart_disease	289	0.56	0.50	0	0	1.0	1.0	1.0	5

These functions serve most of my needs for initial peeking at the data. They return a tibble/data.frame class object that makes for easy presentation and visualization. The underlying code uses the widely used and tested tidyverse packages, and mostly adheres to standard programming conventions. This is the same sort of thing I’m looking for.

arsenal

We begin alphabetically with the arsenal package. Here we use tableby for a generic summary as well as grouped summary. The result is markdown, so for presentation in an R Markdown document, one must use the chunk option results='asis'.


library(arsenal)

hd_sample %>%
  tableby( ~ ., data = .) %>%
  summary()

	Overall (N=303)
age
N-Miss	20
Mean (SD)	54.678 (8.997)
Range	29.000 - 77.000
cholesterol
N-Miss	12
Mean (SD)	246.574 (52.132)
Range	126.000 - 564.000
resting_bp
N-Miss	17
Mean (SD)	131.605 (17.472)
Range	94.000 - 200.000
sex
N-Miss	18
female	91 (31.9%)
male	194 (68.1%)
chest_pain_type
N-Miss	12
asymptomatic	21 (7.2%)
atypical angina	47 (16.2%)
non-anginal pain	83 (28.5%)
typical angina	140 (48.1%)
heart_disease
N-Miss	14
Mean (SD)	0.557 (0.498)
Range	0.000 - 1.000


hd_sample %>%
  tableby(sex ~ ., data = .) %>%
  summary(digits = 1)

	female (N=91)	male (N=194)	Total (N=285)	p value
age				0.173
N-Miss	3	15	18
Mean (SD)	56.0 (9.2)	54.4 (8.8)	54.9 (9.0)
Range	34.0 - 76.0	29.0 - 77.0	29.0 - 77.0
cholesterol				0.001
N-Miss	1	11	12
Mean (SD)	263.0 (65.9)	241.6 (42.1)	248.6 (52.1)
Range	141.0 - 564.0	131.0 - 353.0	131.0 - 564.0
resting_bp				0.344
N-Miss	5	10	15
Mean (SD)	133.3 (19.5)	131.1 (16.6)	131.8 (17.6)
Range	100.0 - 200.0	94.0 - 192.0	94.0 - 200.0
chest_pain_type				0.030
N-Miss	3	9	12
asymptomatic	3 (3.4%)	17 (9.2%)	20 (7.3%)
atypical angina	15 (17.0%)	28 (15.1%)	43 (15.8%)
non-anginal pain	34 (38.6%)	44 (23.8%)	78 (28.6%)
typical angina	36 (40.9%)	96 (51.9%)	132 (48.4%)
heart_disease				< 0.001
N-Miss	5	9	14
Mean (SD)	0.8 (0.4)	0.5 (0.5)	0.5 (0.5)
Range	0.0 - 1.0	0.0 - 1.0	0.0 - 1.0

The output is nice and clean in general, though the default is to do statistical tests, so additional arguments would be required to turn that off. Here is an example of categorical-only output using freqlist.


with(hd, table(sex, chest_pain_type)) %>% 
  freqlist() %>% 
  summary(digits.pct = 1)

sex	chest_pain_type	Freq	Cumulative Freq	Percent	Cumulative Percent
female	asymptomatic	3	3	1.1	1.1
	atypical angina	15	18	5.5	6.6
	non-anginal pain	34	52	12.5	19.0
	typical angina	36	88	13.2	32.2
male	asymptomatic	17	105	6.2	38.5
	atypical angina	28	133	10.3	48.7
	non-anginal pain	44	177	16.1	64.8
	typical angina	96	273	35.2	100.0

Other options include a function for doing pairwise comparisons for a repeated measure³, comparing datasets, and doing a bunch of bivariate regressions.

Pros

The tableby summary is essentially a ‘Table 1’, which is an unfortunately named display of descriptive stats for a sample of a given study⁴. Naming aside, my clients often want exactly that, and I’ve actually used a package specifically geared towards providing it just to save me the headache (tableOne), so I definitely find that aspect useful. However, Table 1’s almost invariably have needless statistical output or analysis, and are not a model for reporting that I would choose to go for. As you can see, while the layout is not going to be viable for more than a few variables, it is common practice to present them that way in journal articles regardless of verbosity/legibility.

Issues

I’m not thrilled with markdown output as the default, as I have little control over it, but it’s fine. Along with that, the default layout is not so succinct, and it appears it’s trying to emulate SAS functions, which are not models for presentation in my opinion. One can use a function to write the output of a single table to html, but I’d rather it just be amenable to the document I’m already creating, or create a report for me of all tables. This can be done using the as.data.frame, but at that point I haven’t really gained over my own functions. Lastly, I’m not crazy about using the summary function to get the output. The underlying list objects are not ‘print ready’, so using summary (or as.data.frame) as an argument with default of TRUE would make more sense to me design-wise.

DataExplorer

Now we move to DataExplorer. Let’s introduce the package with the introduce function. I already like this, as it provides good info that extends what you’d get with the base R function str or dplyr’s glimpse.


library(DataExplorer)

introduce(hd)


# A tibble: 1 x 9
   rows columns discrete_columns continuous_colu… all_missing_col… total_missing_v… complete_rows total_observati…
  <int>   <int>            <int>            <int>            <int>            <int>         <int>            <int>
1   303      14                6                8                0              213           144             4242
# … with 1 more variable: memory_usage <dbl>

It also can display this information visually in two different ways. I like this, but can’t say I’d ever have a reason to actually use it.


plot_intro(hd)


plot_str(list(hd = hd, gapminder = gapminder_2019))  # this doesn't actually display

We can focus on the missingness, which is nice, but I don’t find a use for bar simple bar charts, as they actually make what is only a single row of information harder to parse, and without any visual interest.


plot_missing(hd_sample)

DataExplorer can also plot distributions, e.g. bar and histograms, for the actual values of the categorical/discrete variables.


plot_bar(hd)

But to demonstrate what I was saying earlier about visualizations, this sort of thing only takes a couple lines of ggplot to do on your own, which you can then customize more easily. At least DataExplorer can save you the trouble and sorts the result in an ordered fashion by default, which has always been a pain with ggplot2.


hd %>% 
  select_if(is.character) %>% 
  pivot_longer(everything(), names_to = 'variable', values_to = 'value') %>% 
  ggplot(aes(x = value)) +
  geom_bar() +
  coord_flip() +
  facet_wrap(~ variable, scales = 'free')

Similarly there are QQ plots, scatterplots and more. I always try to get people to try correlation plots in lieu of large correlation matrices, and DataExplorer provides this. The nice thing is that it will automatically create indicator variables for levels of categorical variables, but beyond that there are issues. For one, it’s diagonal is reversed from the typical presentation of correlation matrices⁵. If you don’t drop the missing via the additional argument, the plot isn’t as useful, because it will include NA as a factor level. In addition, ‘centered’ is a rather odd choice for alignment of the x axis in my opinion, so this would take additional work to make presentable. But to its credit, DataExplorer has an option to only focus on continuous or discrete output.


plot_correlation(
  hd,
  ggtheme = theme_minimal(),
  cor_args = list("use" = "pairwise.complete.obs")
)

There are packages that do this sort of plot specifically, such as heatmaply. Before these came around I already had my own function, that also does an internal factor analysis to sort the variables, and produces an interactive result. So while so this aspect of DataExplorer might be useful to others, it doesn’t appeal much to me, especially given the other issues, and one of the other packages we’ll see does it a bit better in my opinion.


hd %>% 
  select_if(is.numeric) %>% 
  cor(use = 'pair') %>% 
  visibly::corr_heat()

If the various options of output, or only some of them, appeal to you, they can all be nicely wrapped up in an automatic report. Various settings for each type of output can be set with an additional function (configure_report) or just passing a list of arguments, including which functions to use, ggplot2 theme, etc. You can download the report here,⁶ but I show a screenshot.


create_report(
  hd,
  y = 'heart_disease',
  output_dir   = 'other_docs',
  output_file  = 'data_explorer_report.html',
  report_title = 'My Data Description'
)

Pros

I think many would like at least some functionality in DataExplorer, as well as many of the visualizations. The ease with which to generate a report should also be sufficient for anyone’s personal use, and with some tweaking, presentation to others. It also uses data.table under the hood, so likely can handle large data with more efficiency than the other packages. Not covered here, but DataExplorer also has functionality for feature processing and engineering, for example, collapsing sparse categories, dummy coding, etc.

Issues

The issues I have are pretty minor with this one aside from unnecessary statistical analysis and visualization choices. I would also recommend pryr::object_size rather than base R’s function.

gtsummary

The gtsummary package wasn’t available when the original article went out, but despite being new, it’s seen as a companion to gt, which is put out by RStudio and has been around for some time. It is also attuned to tidy tibbles created by the usual tidyverse approach.


library(gtsummary)

hd %>% 
  tbl_summary()

Characteristic	N = 303¹
age	56 (48, 61)
Unknown	20
sex
female	91 (32%)
male	194 (68%)
Unknown	18
chest_pain_type
asymptomatic	21 (7.2%)
atypical angina	47 (16%)
non-anginal pain	83 (29%)
typical angina	140 (48%)
Unknown	12
resting_bp	130 (120, 140)
Unknown	17
cholesterol	241 (211, 274)
Unknown	12
fasting_blood_sugar
gt_120	40 (14%)
lt_120	247 (86%)
Unknown	16
resting_ecg
0	139 (48%)
1	144 (50%)
2	4 (1.4%)
Unknown	16
max_heartrate	153 (133, 166)
Unknown	6
exer_angina	98 (34%)
Unknown	18
old_peak	0.80 (0.00, 1.60)
Unknown	14
slope
flat	135 (46%)
negative	136 (47%)
positive	21 (7.2%)
Unknown	11
n_vessels
0	165 (57%)
1	63 (22%)
2	37 (13%)
3	18 (6.2%)
4	5 (1.7%)
Unknown	15
defect
fixed_defect	155 (56%)
normal	17 (6.1%)
reversible_defect	107 (38%)
Unknown	24
heart_disease	161 (56%)
Unknown	14
¹ Statistics presented: median (IQR); n (%)

This is similar to what we’ve seen, but I will note that the footnote was automatically generated, saving me the trouble, and it’s nice to have a single function essentially give me what I want. It will also do the ‘by’ approach, and I show a couple options for tailoring the output.


hd_sample %>%
  tbl_summary(
    by = sex,
    statistic = list(all_continuous() ~ "{mean} ({p10}, {p90})"),
    digits = list(age ~ c(2, 1)),  # mean is 2 digits, sd 1
    label = list(resting_bp ~ 'Resting Blood Press.')
  )

Characteristic	female, N = 91¹	male, N = 194¹
age	55.95 (42.7, 66.30)	54.36 (43.0, 66.00)
Unknown	3	15
cholesterol	263 (197, 340)	242 (188, 300)
Unknown	1	11
Resting Blood Press.	133 (109, 158)	131 (110, 152)
Unknown	5	10
chest_pain_type
asymptomatic	3 (3.4%)	17 (9.2%)
atypical angina	15 (17%)	28 (15%)
non-anginal pain	34 (39%)	44 (24%)
typical angina	36 (41%)	96 (52%)
Unknown	3	9
heart_disease	65 (76%)	84 (45%)
Unknown	5	9
¹ Statistics presented: mean (10%, 90%); n (%)

If we assign the table, we can then use its values inline, which is extremely useful for write-up.


tab = hd_sample %>%
  tbl_summary()

inline_text(tab, variable = age, column = "stat_0")


[1] "56 (48, 61)"

We would then put something like ‘r inline_text(tab, variable = age, column = "stat_0")’ in the actual text we are writing to actually print the result.

Here is an example of some customization options where I include a custom function, rearrange some things, add bold, etc.


geomean <- function (x, na.rm = TRUE) {
    if (is.null(nrow(x))) {
        exp(mean(log(x), na.rm = TRUE))
    }
    else {
        exp(apply(log(x), 2, mean, na.rm = na.rm))
    }
}

set_gtsummary_theme(theme_gtsummary_compact())

hd_sample %>%
  tbl_summary(
    by = sex,
    missing = "no",
    statistic = all_continuous() ~ "{geomean} ({p75}, {p25}) [N = {N_nonmiss}]"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  italicize_levels() %>% 
  modify_spanning_header(starts_with("stat_") ~ "**Sex**")

Characteristic	N	Sex
Characteristic	N	female, N = 91¹	male, N = 194¹
age	267	55 (63, 50) [N = 88]	54 (60, 48) [N = 179]
cholesterol	273	256 (301, 219) [N = 90]	238 (270, 212) [N = 183]
resting_bp	270	132 (140, 120) [N = 86]	130 (140, 120) [N = 184]
chest_pain_type	273
asymptomatic		3 (3.4%)	17 (9.2%)
atypical angina		15 (17%)	28 (15%)
non-anginal pain		34 (39%)	44 (24%)
typical angina		36 (41%)	96 (52%)
heart_disease	271	65 (76%)	84 (45%)
¹ Statistics presented: geomean (75%, 25%) [N = N]; n (%)

Pros

The gtsummary package strikes me as a very powerful tool for customized tables that work seamlessly within a tidyverse workflow. It was easy to immediately start using. Not shown, but it also has good support for standard models (e.g. lm, glm, survival, and more). It also appears that there is a lot of customization possible.

Issues

There is little not to like here, especially for a package just starting out. The only quibble I could see is that the notation for some of the customization is not intuitive, but probably something one can get used to.

SmartEDA

I wanted to look at the SmartEDA package because the figure in the article was of a clean report. Let’s start our exploration with the basic ExpData function. I like these functions which give a useful summary of the data set as a whole


library(SmartEDA)

ExpData(hd)


                                Descriptions       Obs
1                         Sample size (Nrow)       303
2                    No. of Variables (Ncol)        14
3                   No. of Numeric Variables         8
4                    No. of Factor Variables         0
5                      No. of Text Variables         6
6                   No. of Logical Variables         0
7                    No. of Unique Variables         0
8                      No. of Date Variables         0
9   No. of Zero variance Variables (Uniform)         0
10     %. of Variables having complete cases    0% (0)
11 %. of Variables having <50% missing cases 100% (14)
12 %. of Variables having >50% missing cases    0% (0)
13 %. of Variables having >90% missing cases    0% (0)

We can look at ExpNumStat to get some basic stats for numeric variables.


ExpNumStat(hd, round = 1)


          Vname Group  TN nNeg nZero nPos NegInf PosInf NA_Value Per_of_Missing     sum min   max  mean median   SD  CV
1           age   All 303    0     0  283      0      0       20            6.6 15474.0  29  77.0  54.7   56.0  9.0 0.2
3   cholesterol   All 303    0     0  291      0      0       12            4.0 71753.0 126 564.0 246.6  241.0 52.1 0.2
4 max_heartrate   All 303    0     0  297      0      0        6            2.0 44438.0  71 202.0 149.6  153.0 23.1 0.2
5      old_peak   All 303    0    92  197      0      0       14            4.6   306.1   0   6.2   1.1    0.8  1.2 1.1
2    resting_bp   All 303    0     0  286      0      0       17            5.6 37639.0  94 200.0 131.6  130.0 17.5 0.1
   IQR Skewness Kurtosis
1 13.0     -0.2     -0.5
3 63.0      1.2      4.5
4 33.0     -0.5     -0.1
5  1.6      1.2      1.5
2 20.0      0.8      0.9

We can look at ExpNumStat to get some basic stats for grouped output also, and set various options.


ExpNumStat(
  hd_sample,
  by = "GA",
  gp = "sex",
  Qnt = c(.1, .9),
  Outlier = TRUE,
  round = 1
)


         Vname      Group  TN nNeg nZero nPos NegInf PosInf NA_Value Per_of_Missing   sum min  max  mean median   SD
1          age    sex:All 303    0     0  283      0      0       20            6.6 15474  29   77  54.7     56  9.0
4          age   sex:male 194    0     0  179      0      0       15            7.7  9731  29   77  54.4     55  8.8
7          age sex:female  91    0     0   88      0      0        3            3.3  4924  34   76  56.0     57  9.2
10         age     sex:NA   0    0     0    0      0      0        0            NaN     0 Inf -Inf   NaN     NA   NA
2  cholesterol    sex:All 303    0     0  291      0      0       12            4.0 71753 126  564 246.6    241 52.1
5  cholesterol   sex:male 194    0     0  183      0      0       11            5.7 44204 131  353 241.6    236 42.1
8  cholesterol sex:female  91    0     0   90      0      0        1            1.1 23673 141  564 263.0    255 65.9
11 cholesterol     sex:NA   0    0     0    0      0      0        0            NaN     0 Inf -Inf   NaN     NA   NA
3   resting_bp    sex:All 303    0     0  286      0      0       17            5.6 37639  94  200 131.6    130 17.5
6   resting_bp   sex:male 194    0     0  184      0      0       10            5.2 24131  94  192 131.1    130 16.6
9   resting_bp sex:female  91    0     0   86      0      0        5            5.5 11466 100  200 133.3    130 19.5
12  resting_bp     sex:NA   0    0     0    0      0      0        0            NaN     0 Inf -Inf   NaN     NA   NA
    CV  IQR Skewness Kurtosis   10%   90% LB.25% UB.75% nOutliers
1  0.2 13.0     -0.2     -0.5  42.2  66.0   28.5   80.5         0
4  0.2 12.0     -0.2     -0.4  43.0  66.0   30.0   78.0         1
7  0.2 13.0     -0.3     -0.6  42.7  66.3   30.5   82.5         0
10  NA   NA      NaN      NaN    NA    NA     NA     NA         0
2  0.2 63.0      1.2      4.5 188.0 309.0  116.5  368.5         5
5  0.2 58.0      0.2     -0.4 188.0 299.8  125.0  357.0         0
8  0.3 81.8      1.3      3.8 197.0 340.1   96.6  423.6         1
11  NA   NA      NaN      NaN    NA    NA     NA     NA         0
3  0.1 20.0      0.8      0.9 110.0 152.0   90.0  170.0         8
6  0.1 20.0      0.6      0.6 110.0 152.0   90.0  170.0         3
9  0.1 20.0      0.8      1.0 109.0 157.5   90.0  170.0         5
12  NA   NA      NaN      NaN    NA    NA     NA     NA         0

There is also some visualization of relationships with a given target variable.


ExpNumViz(hd_sample, target = 'cholesterol')


[[1]]


[[2]]

Here are the default categorical data summaries. This is a nice and clean data.frame presentation.


ExpCTable(hd_sample)


          Variable            Valid Frequency Percent CumPercent
1              sex           female        91   30.03      30.03
2              sex             male       194   64.03      94.06
3              sex               NA        18    5.94     100.00
4              sex            TOTAL       303      NA         NA
5  chest_pain_type     asymptomatic        21    6.93       6.93
6  chest_pain_type  atypical angina        47   15.51      22.44
7  chest_pain_type               NA        12    3.96      26.40
8  chest_pain_type non-anginal pain        83   27.39      53.79
9  chest_pain_type   typical angina       140   46.20      99.99
10 chest_pain_type            TOTAL       303      NA         NA
11   heart_disease                0       128   42.24      42.24
12   heart_disease                1       161   53.14      95.38
13   heart_disease               NA        14    4.62     100.00
14   heart_disease            TOTAL       303      NA         NA

As there was with numeric variables, there is also visualization for the categorical variables.


ExpCatViz(hd_sample)


[[1]]


[[2]]


[[3]]

As far as reporting, SmartEDA also provides this functionality. There are some options you can fiddle with, like using your own template, changing the default theme, etc. You can download the report I made⁷, though a screenshot is provided.


ExpReport(
  hd,
  theme   = visibly::theme_clean(),
  op_dir  = 'other_docs/',
  op_file = 'smarteda.html'
)

Pros

SmartEDA is fairly intuitive to use. It returns a data frame, and some of the less verbose output is quite ready to go. It can also generate a report in automatic fashion.

Issues

A lot of this isn’t very useful to me, such as parallel coordinate plots, ‘outlier’ analysis, etc. I’m not crazy about the naming conventions, both for the functions and arguments (every single function in the package begins with Exp), but I guess it makes it easier to find via auto-complete. The default color schemes for some output is simply not viable for presentation, and the report doesn’t really have any options for controlling the output like DataExplorer did. I also had issues with attempting to get it to work outside of the current working directory for the initial document creation. And changing the sn or sc options, which I’m not sure why I would only want a sample of plots rather than specifying which plots I’d want specifically, didn’t appear to actually change the resulting document, but maybe I’m missing something.

summarytools

The summarytools package provides four main functions to work with, but I’m going to skip those and go straight to the tool that uses those key functions and puts their results into a very nice presentation. In the following I use dfSummary to provide information for both numeric and categorical output, and even some visualization.


library(summarytools)

dfSummary(
  hd,
  varnumbers = FALSE,
  round.digits = 2,
  plain.ascii = FALSE,
  style = "grid",
  graph.magnif = .33,
  valid.col = FALSE,
  tmp.img.dir = "img"
)

Data Frame Summary

hd

Dimensions: 303 x 14
Duplicates: 0

Variable	Stats / Values	Freqs (% of Valid)	Missing
age [numeric]	Mean (sd) : 54.7 (9) min < med < max: 29 < 56 < 77 IQR (CV) : 13 (0.2)	41 distinct values	20 (6.6%)
sex [character]	1. female 2. male	91 (31.9%) 194 (68.1%)	18 (5.94%)
chest_pain_type [character]	1. asymptomatic 2. atypical angina 3. non-anginal pain 4. typical angina	21 ( 7.2%) 47 (16.2%) 83 (28.5%) 140 (48.1%)	12 (3.96%)
resting_bp [numeric]	Mean (sd) : 131.6 (17.5) min < med < max: 94 < 130 < 200 IQR (CV) : 20 (0.1)	49 distinct values	17 (5.61%)
cholesterol [numeric]	Mean (sd) : 246.6 (52.1) min < med < max: 126 < 241 < 564 IQR (CV) : 63 (0.2)	148 distinct values	12 (3.96%)
fasting_blood_sugar [character]	1. gt_120 2. lt_120	40 (13.9%) 247 (86.1%)	16 (5.28%)
resting_ecg [numeric]	Mean (sd) : 0.5 (0.5) min < med < max: 0 < 1 < 2 IQR (CV) : 1 (1)	0 : 139 (48.4%) 1 : 144 (50.2%) 2 : 4 ( 1.4%)	16 (5.28%)
max_heartrate [numeric]	Mean (sd) : 149.6 (23.1) min < med < max: 71 < 153 < 202 IQR (CV) : 33 (0.2)	91 distinct values	6 (1.98%)
exer_angina [character]	1. no 2. yes	187 (65.6%) 98 (34.4%)	18 (5.94%)
old_peak [numeric]	Mean (sd) : 1.1 (1.2) min < med < max: 0 < 0.8 < 6.2 IQR (CV) : 1.6 (1.1)	40 distinct values	14 (4.62%)
slope [character]	1. flat 2. negative 3. positive	135 (46.2%) 136 (46.6%) 21 ( 7.2%)	11 (3.63%)
n_vessels [numeric]	Mean (sd) : 0.7 (1) min < med < max: 0 < 0 < 4 IQR (CV) : 1 (1.4)	0 : 165 (57.3%) 1 : 63 (21.9%) 2 : 37 (12.8%) 3 : 18 ( 6.2%) 4 : 5 ( 1.7%)	15 (4.95%)
defect [character]	1. fixed_defect 2. normal 3. reversible_defect	155 (55.6%) 17 ( 6.1%) 107 (38.4%)	24 (7.92%)
heart_disease [numeric]	Min : 0 Mean : 0.6 Max : 1	0 : 128 (44.3%) 1 : 161 (55.7%)	14 (4.62%)

This is exactly what I want- basic, not overwhelming and redundant information, a usable data frame object, simple visuals to enhance the output without adding to the total information you have to parse (and which can be turned off), and basic categorical information. In one function, I have pretty much all I’d need, but with control to tweak as necessary. The following shows how one can use standard group_by piping to get a grouped result.


hd_sample %>% 
  group_by(sex) %>% 
  dfSummary(
    varnumbers = FALSE,
    round.digits = 2,
    plain.ascii = FALSE,
    na.col = FALSE,
    style = "grid",
    graph.magnif = .33,
    valid.col = FALSE,
    useNA = FALSE,
    tmp.img.dir = "/tmp"
  )

Data Frame Summary

hd_sample

Group: sex = female
Dimensions: 91 x 6
Duplicates: 0

Variable	Stats / Values	Freqs (% of Valid)
age [numeric]	Mean (sd) : 56 (9.2) min < med < max: 34 < 57 < 76 IQR (CV) : 13 (0.2)	34 distinct values
cholesterol [numeric]	Mean (sd) : 263 (65.9) min < med < max: 141 < 255 < 564 IQR (CV) : 81.8 (0.3)	76 distinct values
resting_bp [numeric]	Mean (sd) : 133.3 (19.5) min < med < max: 100 < 130 < 200 IQR (CV) : 20 (0.1)	32 distinct values
sex [character]	1. female	91 (100.0%)
chest_pain_type [character]	1. asymptomatic 2. atypical angina 3. non-anginal pain 4. typical angina	3 ( 3.4%) 15 (17.1%) 34 (38.6%) 36 (40.9%)
heart_disease [numeric]	Min : 0 Mean : 0.8 Max : 1	0 : 21 (24.4%) 1 : 65 (75.6%)

Group: sex = male
Dimensions: 194 x 6
Duplicates: 0

Variable	Stats / Values	Freqs (% of Valid)
age [numeric]	Mean (sd) : 54.4 (8.8) min < med < max: 29 < 55 < 77 IQR (CV) : 12 (0.2)	38 distinct values
cholesterol [numeric]	Mean (sd) : 241.6 (42.1) min < med < max: 131 < 236 < 353 IQR (CV) : 58 (0.2)	109 distinct values
resting_bp [numeric]	Mean (sd) : 131.1 (16.6) min < med < max: 94 < 130 < 192 IQR (CV) : 20 (0.1)	42 distinct values
sex [character]	1. male	194 (100.0%)
chest_pain_type [character]	1. asymptomatic 2. atypical angina 3. non-anginal pain 4. typical angina	17 ( 9.2%) 28 (15.1%) 44 (23.8%) 96 (51.9%)
heart_disease [numeric]	Min : 0 Mean : 0.5 Max : 1	0 : 101 (54.6%) 1 : 84 (45.4%)

Group: sex = NA
Dimensions: 18 x 6
Duplicates: 0

Variable	Stats / Values	Freqs (% of Valid)
age [numeric]	Mean (sd) : 51.2 (9.1) min < med < max: 35 < 52.5 < 71 IQR (CV) : 11 (0.2)	16 distinct values
cholesterol [numeric]	Mean (sd) : 215.3 (43.5) min < med < max: 126 < 213.5 < 302 IQR (CV) : 44 (0.2)	18 distinct values
resting_bp [numeric]	Mean (sd) : 127.6 (15) min < med < max: 108 < 122 < 160 IQR (CV) : 16 (0.1)	13 distinct values
sex [character]	All NA’s
chest_pain_type [character]	1. asymptomatic 2. atypical angina 3. non-anginal pain 4. typical angina	1 ( 5.6%) 4 (22.2%) 5 (27.8%) 8 (44.4%)
heart_disease [numeric]	Min : 0 Mean : 0.7 Max : 1	0 : 6 (33.3%) 1 : 12 (66.7%)

Pros

The summarytools package provides a single function that produces a ready-to-present table within whatever document I’m already using. Very nice!

Issues

The only nitpicky stuff I have with this package is that the ctable function isn’t useful to me, and some of the numeric description is not really necessary. I had issues with the trying to use the suggested tmp directory for the image files, but found it easier to just use a project folder. I couldn’t find a straightforward way to with the summarytools function to not include the NA category for sex, but this can be done piping to drop_na(sex) before the dfSummary call. I also don’t care for using pander, as finding out what it can and can’t do for a particular class of object a bit of a pain with the documentation.

Data Validity

Now we move to data validity. As a reminder, data validity is more about providing checks on the data rather than summarizing it per se. These are packages more geared toward spotting or dealing with data issues.

dataMaid

The primary utility of dataMaid is examination of data consistency, but it will also provide basic summaries too, and in the end, it creates a useful report as well. I came across this via the author’s presentation at the Ann Arbor R User Group, and have actually used it before for a project with success. The result can be found here ⁸, though here is a screenshot.


library(dataMaid)

dataMaid::makeDataReport(
  hd,
  output = 'html',
  file = 'other_docs/dataMaid_report',
  replace = TRUE,
  maxDecimals = 1
)

As you can see, it provides an overall summary of the data frame including variable types, unique values, missing percentage, and potential problems. Problems mostly regard outliers, which is fine to inspect, but arbitrarily set.

Pros

dataMaid is highly customizable, but the default is already a great way to get a good sense of whether your data is doing what it’s supposed to do.

Issues

I initially couldn’t get anything but a pdf document even though I clearly specified html as per the documentation, so I’m not sure what’s going on there, but I eventually sorted it out. The default outlier check, while better than most would do, seemed too sensitive, but this is nitpicky. In the past I’ve found it be slow in rendering the output for a larger data set, but this isn’t something you would usually have to run more than once after getting your options squared away.

janitor

Sadly, I still regularly receive data from Excel (and SPSS), which means I have to worry about things that I shouldn’t have to worry about, like whether dates will actually be treated appropriately, if the column names are usable, or whether the imported file contains 10000 empty rows because someone accidentally hit the space bar.

The janitor package provides a few simple tools that many will probably find useful at some point, and I regularly use its remove_empty function after importing anything from Excel. As an example I’ll add an empty column with an Excel-like name, a duplicated row, and a useless column with a constant value (all very typical in Excel files I receive).


library(janitor)

hd2 = hd

hd2$`test this - ridiculous (name)` = NA           # a terribly formatted name that has no values
hd2$`why.is-this_here` = 'same value everywhere'   # constant column
hd2[nrow(hd2) + 1,] = hd2[1,]                      # add a duplicated row

hd2 %>% 
  clean_names() %>% 
  colnames()


 [1] "age"                       "sex"                       "chest_pain_type"           "resting_bp"               
 [5] "cholesterol"               "fasting_blood_sugar"       "resting_ecg"               "max_heartrate"            
 [9] "exer_angina"               "old_peak"                  "slope"                     "n_vessels"                
[13] "defect"                    "heart_disease"             "test_this_ridiculous_name" "why_is_this_here"

And here is the remove_* functionality.


hd3 = hd2 %>% 
  remove_empty() %>% 
  remove_constant() 

# useless columns/rows removed 
colnames(hd3)


 [1] "age"                 "sex"                 "chest_pain_type"     "resting_bp"          "cholesterol"        
 [6] "fasting_blood_sugar" "resting_ecg"         "max_heartrate"       "exer_angina"         "old_peak"           
[11] "slope"               "n_vessels"           "defect"              "heart_disease"


c(nrow(hd2), nrow(hd3))


[1] 304 304

Just a couple little things like that are very useful. Otherwise, there is functionality for dates, tables, etc. that some might find useful in a pinch, but one would probably will have more with the other summary functions in packages we’ve demonstrated, and more advanced users may simply just use lubridate, stringr and similar packages directly.

Pros

Provides useful functionality not seen in the other packages.

Issues

Not a whole lot going on here relative to some of the other packages, but what is there is likely to be useful to many.

visdat

The package visdat⁹ is, as its name implies, purely for visualization, and this includes missingness, correlation, and more. We can start by visualizing the missing data.


library(visdat)

vis_dat(hd_sample)

We can look at variable types along with the missingness. I like the integer/double mix, which might point to an issue if, for example, all the data should be integer valued.


vis_guess(hd)

We can visualize the correlations as we did with DataExplorer, and default is pairwise. This is one of the cleaner correlation plots I’ve come across, but there is no way to order it meaningfully.


hd %>% 
  select_if(is.numeric) %>% 
  vis_cor()

One nice feature I haven’t seen elsewhere is to visualize a given expected value across a data set.


hd %>% 
  select(sex) %>% 
  vis_expect(expectation =  ~ .x == 'male')

We can also compare whole data sets. Here the differences are with missing values in one and not the other¹⁰.


hd_sample_2 = noiris::heart_disease %>% 
  select(colnames(hd_sample)) %>% 
  mutate_if(is.factor, as.character)

vis_compare(
  hd_sample,
  hd_sample_2
)

Pros

This package adds some useful functionality we haven’t seen. In addition, the underlying code adheres to open science standards. There is a complementary package naniar for more ways to deal with missing data.

Issues

This isn’t for numeric description, so can only supplement typical EDA as we’ve had before. You may have to do some pre-processing for some functions unlike with other tools (e.g. subsetting to numeric). But these are minor issues at best given the package’s purpose.

Data Cleaning and Transformations

As mentioned previously, I’m not interested in using these packages for doing this, and based on Staniak and Biecek, only a couple of them provide this, DataExplorer being the only package that we’ve explored here. Mean/median/mode imputation is a great way to attenuate correlation, so I’m not interested in doing that. I don’t even know what an ‘outlier’ is outside of a modeling framework, so I’m definitely not interested in doing something about extreme values based on a univariate analysis¹¹. Discretizing numeric variables should almost never be done¹², so that functionality is not desirable. And between base R functions like scale and log, or the rescale function from scales, lumping factors and similar via forcats, I don’t really need another package for this sort of thing.

Summary

Staniak & Biecek Staniak and Biecek (2019) provide a nice summary of many packages that can be used for automated exploratory data analysis. I wanted to explore and demonstrate some of these in more detail, along with a couple packages that weren’t included. My impression is that many of these packages have notably useful functionality, though a lot of it may be overlapping, or superfluous from a more serious analytical standpoint. In addition, I personally wouldn’t use the word automated to describe most of the functionality for most of these packages, as multiple specific functions, possibly with different options for each, would need to be used to ultimately generate a data driven product. That’s not necessarily a bad thing, as one should know what’s in the data if you want to make a decision based on it, but just be clear that ‘automated’ doesn’t mean you still won’t have your work cut out for you.

Here is a rough list of features I made based on quick inspections.

Table 1: See Table 4 in Staniak & Biecek for more details.
Package	Ready to use output	Code Quality	Visualization	Report Generation	Dedicated Website
arsenal					👍
DataExplorer	👍	👍	👍	👍	👍
gtsummary	👍	👍			👍
SmartEDA	👍		👍	👍	👍
summarytools	👍		👍
visdat	👍	👍	👍		👍
dataMaid			👍	👍
janitor	👍	👍			👍

Ready to use output: the primary function supplies a data frame or ggplot class object that are essentially ready for presentation.
Code Quality: uses type-safe checks, active development (notable recent activity and response to issues), explicitly uses some standard (e.g. rOpenSci), but mostly my own personal opinion after a very quick glance at a very small portion of the underlying code. Don’t take this one too seriously.
Visualization: includes at least some visualization aspect.
Report Generation: has a function to create a separate html or pdf file of multiple results
Dedicated Website: more than just a GitHub repo, has a website with vignettes. All of these packages are on GitHub, so with pkgdown there is no reason not to have this.

In general, for my needs I would imagine I’d use summarytools or DataExplorer for presentation to clients or team members, and gtsummary as the workhorse, especially for publication. In addition, I would likely use visdat and janitor to fill in additional information or help with the data processing. Interestingly, some of these also appear to be the most popular packages under consideration¹³, so it’s good to know they are more widely used, though that was a criterion for selection as well. Any of them might be useful for your specific needs, particularly if you’re willing to invest in some of the more customizable ones, so take a spin with any you like.

Exercises

Find a data set you like and do the following. I suggest using a manageable one or some subset so that you don’t get an overwhelming amount of output.

Create an R Markdown file that will produce an html or pdf product
after your initial data import, and clean it with janitor
use visdat to inspect some aspect of it, like missingness
use dataMaid to do a validity check
use summarytools to provide an in-document assessment of the data

Staniak, Mateusz, and Przemyslaw Biecek. 2019. “The Landscape of R Packages for Automated Exploratory Data Analysis.” The R Journal. https://journal.r-project.org/archive/2019/RJ-2019-033/index.html.

Wirth, Rüdiger, and Jochen Hipp. 2000. “CRISP-Dm: Towards a Standard Process Model for Data Mining.” In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, 29–39. Springer-Verlag London, UK.

However, for our demonstration I’ll be using something a little more wieldy.↩︎
To be honest, I wasn’t familiar with the CRoss Industry Standard Process for Data Mining until reading the article citing it. I don’t get the impression any particular methodology is actually consciously thought about by the vast majority of practicing data scientists, but it’s useful in providing a framework for the content here.↩︎
The data we’re using isn’t paired and their documentation doesn’t provide a working example.↩︎
I see ‘Table 1’ used mostly by folks in medical fields, who subsequently place it as table 2, 3 or whatever as would normally be the case.↩︎
I suspect this may have more to do with ggplot than what is desired. But I also found that the orientation/alignment of names was different for the report versus the inline presentation.↩︎
You need to download the file and explicitly open it in your browser.↩︎
You need to download the file and explicitly open it in your browser.↩︎
You need to download the file and explicitly open it in your browser.↩︎
The article by Staniak & Biecek places visdat with description rather than validity, but the data frame comparison, expectation investigation, and missing exploration seem more validity issues than data summaries to me.↩︎
The arsenal package also provided this functionality. It’s quite verbose, so I didn’t show it, but if you need a more in depth comparison it can be recommended.↩︎
Outliers are an indicator of model weakness/limitation/failure, not a data problem. If it is an actual data problem, it should be corrected or removed.↩︎
The only times I do this is with an already discrete variable, for example, going from 7 values to 5 values, or coarsening that might be applied in very large data situations for computational efficiency. Issues with discretizing are noted here, and while there are better ways to do it if you feel you must, you probably won’t have seen it done in practice, and the chosen method likely wouldn’t hold across repeated data.↩︎
The janitor package, which was not explored in the original article, is actually notably more popular than any of the other packages.↩︎

Exploratory Data Analysis

Table of Contents

Introduction

Packages

Selected packages

Arbitrary criteria

Conceptual organization

The Example Dataset

Data Description

Preliminary

arsenal

Pros

Issues

DataExplorer

Pros

Issues

gtsummary

Pros

Issues

SmartEDA

Pros

Issues

summarytools

Data Frame Summary

hd

Data Frame Summary

hd_sample

Pros

Issues

Data Validity

dataMaid

Pros

Issues

janitor

Pros

Issues

visdat

Pros

Issues

Data Cleaning and Transformations

Summary

Links

Description packages

Validation packages

Exercises