11  Danger Zone

The following are very brief thoughts on common pitfalls in modeling we thought of while writing this book. This is not meant to be comprehensive, but rather a quick reference for common things we think you’d probably want to avoid. We’re not going to tell you not to do these things, but we are going to tell you to be very careful if you do, and be ready to defend your choices.

11.1 Things to Generally Avoid

LM:

  • stepwise/best subset regression
  • focusing on statistical significance while ignoring predictive/practical results
  • forgetting to rely heavily on domain knowledge
  • forgetting to rely heavily on visualizations
  • use statistical tests in lieu of visualizations or practical metrics for assumptions
  • compare models with R2
  • drop/modify extreme values

GLM:

  • Use pseudo-R2 for any important conclusion. These aren’t very useful and will never really be the R2 you get from OLS.
  • Think accuracy is the primary metric for your logistic regression.
  • Think GLMs are enough for your problem. For example, Poisson’s assumption of mean = variance is very restrictive and typically doesn’t hold.

Estimation:

  • overly worry about distribution if primary goal is interpretation
  • assume a bootstrap is good enough for inference
  • be overly concerned with which optimization technique you choose

Data:

  • forget to transform your features
  • use simple imputation (like the mean, modal category) when you’re missing a lot of data
  • make something binary when it naturally isn’t and has notable variability
  • Forget that large data can get small very quickly due to class imbalance, interactions, etc.

ML:

  • use a single model without any baseline
  • start with a complicated model
  • use a simple .5 cutoff for binary classification
  • binarize your target just so you can do classification
  • use a single metric for model assessment
  • ignore uncertainty in your predictions or metrics
  • compare models on different datasets
  • let training leak into test set
  • assume variable importance is telling you what you think it is
  • assume that because your approach dropped a feature that it has no effect on the target
  • use a model just because it seems popular
  • use a model just because it’s the only one you know
  • use older black box methods that don’t keep up with more modern techniques
  • forget to scale your data
  • forget that the data is more important than your modeling technique
  • forget you need to be able to explain your model to someone else (goes for your code also)
  • assume that grid search is good enough for all or even most cases
  • ignore temporal/spatial data structure
  • think deep learning will solve all your problems
  • use shap for feature importance (it wasn’t meant for this and can be very misleading)
  • using ranks orders of predictions without understanding the increased uncertainty

Causal:

  • pretend a modeling technique can prove a causal relationship
  • think random assignment can prove a causal relationship
  • ignore confounding variables
  • ignore the possibility of reverse causality
  • ignore the possibility of selection bias
  • ignore measurement error

11.2 Perfect is not Possible

Your model will never be perfect. Time, money, and other constraints will conspire to make sure of that, and you will always have to make trade-offs. You’re also human, and mistakes will be made. The best you can do is to be aware of these limitations and do your best to mitigate them. Modeling is about exploration and discovery, and the best models are the ones that are the most useful, not the most perfect.