11 Danger Zone

The following are very brief thoughts on common pitfalls in modeling we thought of while writing this book. This is not meant to be comprehensive, but rather a quick reference for common things we think you’d probably want to avoid. We’re not going to tell you not to do these things, but we are going to tell you to be very careful if you do, and be ready to defend your choices.

11.1 Things to Generally Avoid

LM:

stepwise/best subset regression
focusing on statistical significance while ignoring predictive/practical results
forgetting to rely heavily on domain knowledge
forgetting to rely heavily on visualizations
use statistical tests in lieu of visualizations or practical metrics for assumptions
compare models with R²
drop/modify extreme values

GLM:

Use pseudo-R² for any important conclusion. These aren’t very useful and will never really be the R² you get from OLS.
Think accuracy is the primary metric for your logistic regression.
Think GLMs are enough for your problem. For example, Poisson’s assumption of mean = variance is very restrictive and typically doesn’t hold.

Estimation:

overly worry about distribution if primary goal is interpretation
assume a bootstrap is good enough for inference
be overly concerned with which optimization technique you choose

Data:

forget to transform your features
use simple imputation (like the mean, modal category) when you’re missing a lot of data
make something binary when it naturally isn’t and has notable variability
Forget that large data can get small very quickly due to class imbalance, interactions, etc.

ML:

use a single model without any baseline
start with a complicated model
use a simple .5 cutoff for binary classification
binarize your target just so you can do classification
use a single metric for model assessment
ignore uncertainty in your predictions or metrics
compare models on different datasets
let training leak into test set
assume variable importance is telling you what you think it is
assume that because your approach dropped a feature that it has no effect on the target
use a model just because it seems popular
use a model just because it’s the only one you know
use older black box methods that don’t keep up with more modern techniques
forget to scale your data
forget that the data is more important than your modeling technique
forget you need to be able to explain your model to someone else (goes for your code also)
assume that grid search is good enough for all or even most cases
ignore temporal/spatial data structure
think deep learning will solve all your problems
use shap for feature importance (it wasn’t meant for this and can be very misleading)
using ranks orders of predictions without understanding the increased uncertainty

Causal:

pretend a modeling technique can prove a causal relationship
think random assignment can prove a causal relationship
ignore confounding variables
ignore the possibility of reverse causality
ignore the possibility of selection bias
ignore measurement error

11.2 Perfect is not Possible

Your model will never be perfect. Time, money, and other constraints will conspire to make sure of that, and you will always have to make trade-offs. You’re also human, and mistakes will be made. The best you can do is to be aware of these limitations and do your best to mitigate them. Modeling is about exploration and discovery, and the best models are the ones that are the most useful, not the most perfect.