15  Danger Zone

You can usually anticipate and enumerate most of the ways your model will fail to work in advance. Yet the problems you’ll encounter in practice are usually exactly one of those things you knew to watch out for, but failed to.

~ Andrej Karpathy (supposedly)

When it comes to conducting models in data science, a lot can go wrong, and in many cases it’s easy to get lost in the weeds and lose sight of the bigger picture. Throughout the book, we’ve covered many instances in which caution is warranted in the modeling approach. In this chapter, we’ll more explicitly discuss some common pitfalls that can sneak up on you when you’re working on a data science project, and others that just came to mind while we were thinking about it. The topics are based on things we’ve commonly seen in consulting across many academic disciplines and industries, and here we attempt to provide a very general overview. That said, it is by no means exhaustive, and you may come across additional issues in your situation. The following groups of focus attempt to reflect the content of the book as it was presented.

15.1 Linear Models & Related Statistical Endeavors

Statistical models are a powerful tool for understanding the structure and meaning in your data. They are also excellent at helping us to understand the uncertainty in our data and the aspects of the model we wish to estimate. However, there are many ways in which problems can arise with statistical models.

15.1.1 Statistical significance

One of the most common mistakes when conducting statistical linear models is simply relying too heavily on the statistical result. Statistical significance is simply not enough to determine feature importance or model performance. When complex statistical models are applied to small data, the results are typically very noisy and statistical significance can be misleading. This also means that ‘big’ effects can be a reflection of that noise, rather than something meaningful.

Focusing on statistical significance can lead you down other dangerous paths. For example, relying on statistical tests of assumptions instead of visualizations or practical metrics can lead you to believe that your model is valid when it is not. Using a statistical testing approach to select features can often result in incorrect choices about feature contributions, as well as poorer models.

A related issue is p-hacking, which occurs when you try many different models, features, or other aspects of the model until you find one that is statistically significant. This is a problem because it can reflect spurious results, and make it difficult to generalize the results of the model (overfitting). It also means you ignored null results, which can be just as informative as significant ones, a problem known as the file drawer problem.

15.1.2 Ignoring complexity

While techniques like standard linear/logistic regression and GLMs are valid and very useful, for many modeling contexts they may be too simple to capture the complexity of the data generating process, a form of underfitting. On the other side of the coin, many applications of statistical models ignore model assessment on a separate dataset, which can lead to overfitting. This makes generalization of such results more problematic. Those applications typically use a single model as well, and so may not be indicative of the best approach that could be taken. It’d be better to have a few models of varying complexity to explore.

15.1.3 Using outdated techniques

If you wanted to go on a road trip, would you prefer a 1973 Ford Pinto or a Tesla Model S? If you want to browse the web, would you prefer to use a computer from the 90s and 56k modem, or a modern laptop with a high-speed internet connection? In both cases, you could potentially get to your destination or browse the web, but the experience would be much different, and you would likely have a clear preference1. The same goes with the models you use for your data analysis.

This is not specific to the statistical linear modeling realm, but there are many applications of statistical models that rely on outdated techniques, metrics, or other tools that solve problems that don’t exist anymore. For example, using stepwise/best subset regression for feature selection is not really viable when more principled approaches like the lasso are available. Likewise, we can’t really think of a case where something like MANOVA/discriminant function analysis would provide the best answer to a data problem, or where a pseudo-R2 metric would help us understand a model better or make a decision about it.

Statistical analysis has been around a long time, and many of the techniques that have been developed are still valid, useful, and very powerful. But some reflect the limitations of the time in which they were developed. Others were an attempt to take something that was straightforward for simpler settings (e.g., linear regression) and apply to settings where it doesn’t make sense (nonlinear, non-gaussian, etc.). Even when still valid, there may be better alternatives available now.

15.1.4 Simpler is not necessarily more interpretable

Standard linear models are often used because of their interpretability, but in many of these modeling situations, interpretability can be difficult to obtain without using the same amount of effort one would for more complex models. Many statistical/linear models employ interactions, or nonlinear feature-target relationships (e.g., GLM/GAMs). If your goal is interpretability, these settings can be as difficult to interpret as features in a random forest. They still have the added benefit of more reliable uncertainty estimation. But you should not assume you will have a result as simple as a coefficient in a linear regression just because you didn’t use a deep learning model.

15.1.5 Model comparison

When comparing models, especially in the statistical modeling realm, many will use a statistical test to compare them. An example would be using an ANOVA or likelihood ratio test to compare a model with and without interactions. Unfortunately this doesn’t actually tell us how the models perform under realistic settings, and it comes with the usual statistical significance issues, like using an arbitrary threshold for claiming significance. You could basically claim that one terrible model is statistically better than another terrible model, but there isn’t much value in that.

Some like to look at R2 to compare models2, but it has a lot of problems. People think it’s more interpretable than other options, yet there is no value of ‘good’ you can universally apply, even in very similar scenarios. It can arbitrarily increase with the number of features whether they are actually predictive or not, and it doesn’t tell you how well the model will perform on new data. It can also simply reflect that you have time-series data, as you are just witnessing spurious correlations over time. In short, you can use it to get a sense of how your predictions correlate with the target, but that can be a fairly limited assessment.

The following plot shows 250 simulations with a sample size of 100 and 40 completely meaningless features used in a linear regression. The R2 values would all suggest the model is somewhat useful, with an average of ~.4. The adjusted R^2 values average zero, which is correct, but they can only average that by being negative, which is a meaningless value. Many of the adjusted values still get into areas that would be viable for some domains.

The problem of R2

Other commonly used metrics, like AIC, might be better in theory for model comparison. But they approximate the model selection one would get through cross-validation, so why not just do the cross-validation as due diligence? Furthermore, as long as you are using those metrics only on the training data, you probably aren’t getting a good idea of how the model will generalize (Section 10.4).

A common issue in statistical and machine learning modeling is the garden of forking paths. This is the idea that there are many different ways to analyze a dataset, and that the results of these analyses can be very different. When you don’t have a lot of data, or when the data is complex and the data generating process is not well understood, there can be a lot of forks that lead to many different models with varying results. In these cases, the interpretation of a single model from the many that are actually employed can be misleading, and can lead to incorrect conclusions about the data.

15.2 Estimation

15.2.1 What if I just tweak this…

From traditional statistical models to deep learning, the more you know about the underlying modeling process, the more apt you are to tweak some aspect of the model to try and improve performance. When you start thinking about changing optimizer options, link/activation functions, learning rates, etc., you can easily get lost in the weeds. This would be okay if you knew ahead of time it would make a big difference. However, in many, or maybe even most cases, this sort of tweaking doesn’t improve model results by much, or there are ways to not have to make the choice in the first place such as through hyperparameter tuning (Section 10.7). More to the point, if this sort of ‘by-hand’ parameter tweaking does make a notable difference, that may suggest that you have a bigger problem with your model architecture or data.

For many tools, a lot of work has been done for you by folks who had a lot more time to work on these aspects of the model, and who will attempt to provide ‘sensible defaults’ which can work pretty well. There is still plenty we need to explore, and maybe a lot with more complex models such as boosting or deep learning. Even so, when you’ve appropriately tuned over the parameters that need it, you’ll often find the results are not that different from what are otherwise notably different parameter settings.

15.2.2 Everything is fine

There is a flip side to the previous point, and that is that many assume that the default settings for complex models are good enough. We all do this when venturing into the unknown, but we do so at our own risk. Many of the more complex models have defaults geared toward a ‘just works’ setting rather than a ‘production’ setting. For example, the default number of boosting rounds for xgboost will rarely be adequate3. Again, an appropriately tuned model should cover your bases.

15.2.3 Just bootstrap it!

When it comes to uncertainty estimation, many common modeling tools leave that to the user, and when the developers are pressed on how to get uncertainty estimates, they will often suggest to just bootstrap the result. While the bootstrap is a powerful tool for inference, it isn’t appropriate just because you decide to use it. The suggestion to use bootstrapping is often made in the context of a complex modeling situation where it would be very (prohibitively) computationally expensive, and in other cases the properties of the results are not well understood. Other methods of prediction inference, such as conformal prediction, may be better suited to the task. In general, if a package developer suggests you bootstrap because their package doesn’t have any means of uncertainty estimation, you should be cautious. If it’s the obvious option, it should be included in the package.

While we’re at it, another common suggestion om <: is to use a quantile regression (Section 9.5) approach to get prediction intervals. This is a valid option in some cases, but it’s not clear how appropriate it is for complex models or for certain types of outcomes, and modeling tools for predicting quantiles are not typically available for a given model implementation.

15.3 Machine Learning

15.3.1 General ML modeling issues

We see a lot of issues with machine learning approaches, and many of them are the same as those that come up with statistical models, but some are more unique to the machine learning world. A starting point is that many forget to create a baseline model, and instead jump right into a complicated model. This is a problem because it is hard to improve performance if you don’t know what a good baseline score is. So create that baseline model and iterate from there.

A related point is that many will jump into machine learning without fully investigating the data. Standard exploratory data analysis (EDA) is a prerequisite for any modeling, and can go a long way toward saving time and effort in the modeling process. It’s here you’ll find problematic cases and features, and can explore ways to deal with it.

When choosing a model or set of models, one should have a valid reason for the choice. Some less stellar reasons include using a model just because it seems popular in machine learning. And as mentioned with other types of models, you want to avoid using older methods that really don’t perform well in most situations compared to others4.

15.3.2 Classification

Machine learning is not synonymous with a classification problem, but this point seems to be lost on many. As an example, many will split their target just so they can do classification, when the target is a more expressive continuous variable. This is a problem because you are unnecessarily diminishing the reliability of the target score, and losing information about it. This can lead to a well known statistical issue - attenuation of the correlation between variables.

import numpy as np
import pandas as pd

def simulate_binarize(
    N = 1000,
    correlation = .5,
    num_simulations = 100,
    bin_y_only = False
):
    correlations = []
    
    for i in range(num_simulations):
        # Simulate two variables with the given correlation
        xy = np.random.multivariate_normal(
            mean = [0, 0], 
            cov = [[1, correlation], [correlation, 1]], 
            size = N
        )

        # binarize on median split
        if bin_y_only:
            x_bin = xy[:, 0]
        else:
            x_bin = np.where(xy[:, 0] >= np.median(xy[:, 0]), 1, 0)
        y_bin = np.where(xy[:, 1] >= np.median(xy[:, 1]), 1, 0)
        
        raw_correlation = np.corrcoef(xy[:, 0], xy[:, 1])[0, 1]
        binarized_correlation = np.corrcoef(x_bin, y_bin)[0, 1]
        
        correlations.append({
            'sim': i,
            'raw_correlation': raw_correlation,
            'binarized_correlation': binarized_correlation
        })

    cors = pd.DataFrame(correlations)
    return cors

simulate_binarize(correlation = .25, num_simulations = 5)
simulate_binarize = function(
    N = 1000,
    correlation = .5,
    num_simulations = 100,
    bin_y_only = FALSE
) {
    correlations = list()
    
    for (i in 1:num_simulations) {
        # Simulate two variables with the given correlation
        
        xy = MASS::mvrnorm(
            n = N, 
            mu = c(0, 0), 
            Sigma = matrix(c(1, correlation, correlation, 1), 
            nrow = 2),
            empirical = FALSE
        )

        # binarize on median split
        
        if (bin_y_only) {
            x_bin = xy[, 1]
        } else {
            x_bin = ifelse(xy[, 1] >= median(xy[, 1]), 1, 0)
        }
        
        y_bin = ifelse(xy[, 2] >= median(xy[, 2]), 1, 0)
        
        raw_correlation = cor(xy[, 1], xy[, 2])
        binarized_correlation = cor(x_bin, y_bin)
        
        correlations[[i]] = tibble(
            sim = i,
            raw_correlation,
            binarized_correlation
        )
    }

    cors =  bind_rows(correlations)
    cors
}

simulate_binarize(correlation = .25, num_simulations = 5)

The following plot shows the case where we only binarize the target variable for 500 simulations. The true correlation between the raw and binarized variables is .25, .5, or .75, but the correlation in the binarized case is notably less. This is because the binarization process has removed the correlation between the variables.

Figure 15.1: Density plots of raw and binarized correlations

Common issues with ML classification don’t end here however. Another problem is that many will use a simple .5 cutoff for binary classification, when it is probably not the best choice in most classification settings. Related to this, many only focus on accuracy as a metric for performance. Others are more useful in many situations, or just add more information to assess the model. Each metric has its own pros and cons, so you should evaluate your model’s performance with a suite of metrics.

15.3.3 Ignoring uncertainty

It is very common in ML practice to ignore uncertainty in predictions or metrics. This is a problem because there is always uncertainty, and acknowledging that it exists can help one have better expectations of performance. This is especially true when you are using a model in a production setting, where the model’s performance can have real-world consequences.

It is often computationally difficult to get uncertainty estimates for many of the black-box techniques that are popular in ML. Some might suggest that there is enough data such that uncertainty is not needed, but this would have to be demonstrated in some fashion. Furthermore, there is always increased uncertainty for prediction on new data and for smaller subsets of the population we might be interested in. In general, there are ways to get uncertainty estimates for these models, e.g., bootstrapping, conformal prediction, and simulation, and it is often worth the effort to do so.

15.3.4 Hyperfocus on feature importance

Researchers and businesses often have questions about which features in an ML model are important. Yet this can be a difficult question to answer, and the answer is often not practically useful. For example, most models used in ML are going to have interactions, so the importance of any single feature is likely going to depend on other features in the model. If you can’t disentangle the effects of one feature from another, then trying to talk about a single feature’s relative worth is often a misguided endeavor, even if you use an importance metric that tries to account for the interaction.

Even if we can deem a variable ‘important’, this doesn’t imply a causal relationship, and it doesn’t mean that the variable is the best of the features you have. In addition, other metrics, which might be just as valid, may provide a different rank ordering of importance.

What’s more, just because an importance metric may deem a feature as not important, that doesn’t mean it has no effect on the target. It may be that the feature is correlated with other features that are more important, and so the metric is just reflecting that. It may also just mean that the importance metric is not well suited to assessing that particular feature’s contribution.

As we have seen (Section 5.9), the reality is that multiple valid measures of importance can come to different conclusions about the relative importance of a feature, even within the same model setting. One should be very cautious in how they interpret these.

SHAP values are meant to assess local, i.e., observation level, feature contributions to a prediction. They are also used as global features of importance in many ML contexts, even though they are not meant to be used this way. Doing so can be misleading, and often average SHAP values will just reflect the distribution of the feature more than its importance, and could be notably inconsistent with other metrics even in simple settings.

15.3.5 Other common pitfalls

A few other common pitfalls in ML modeling include:

  • Forgetting that the data is more important than your modeling technique. You will almost always get more mileage out of improving your data than you will out of improving your model.

  • Ignoring data Leakage. Letting training data leak into the test set. As a simple example, consider if we use random validation splits with time-based data. This would allow the model to train with future data it will ultimately be assessed on. That may be an obvious example, but there are many more subtle ways this can happen. Data leakage gives your model an unfair advantage when it is time for testing, leading you to believe that your model is doing better than it really is.

  • Forgetting you will ultimately need to be able to explain your model to someone else. The only good model is a useful one; if you can’t explain it to someone, you can’t expect others to trust you with it or your results.

  • Assuming that grid search is good enough for all or even most cases. Not only is it computationally expensive, but you can easily miss valid tuning parameter values that are outside of the grid. Many other methods are available that more efficiently search the space and are as easy to implement.

  • Thinking deep learning will solve all your problems. If you are dealing with standard, tabular data, at present deep learning will often just increase computational complexity and time, with no guarantee of increased performance. Hopefully this will change in the future, but for now, you should not expect major performance gains.

  • Comparing models on different datasets. If you run different models on separate data, there is no objective way to compare them. As an example, the accuracy may be higher on one dataset just because the baseline rate is much higher.

The list goes on. In short, many of the pitfalls in ML modeling are the same as those in statistical modeling, but there are some unique to or more common in the ML world. The most important thing to remember is that due diligence is key when conducting any modeling exercise, and ML doesn’t change that. You should always be able to explain and defend your model choices and results to someone else.

15.4 Causal Inference

Causal inference and modeling is hard. Very hard.

15.4.1 The hard work is done before data analysis

The most important part of causal modeling is the conceptualization of the problem and the general design of the study to answer the specific questions related to that problem. You have to think very hard about the available data, what variables may be confounders, which effects may be indirect, and many other aspects of the process you want to measure. A causal model is the one you draw up, possibly before you even start collecting data, and it is the one you use to guide your data collection and ultimately your data modeling process.

15.4.2 Models can’t prove causal relationships

Causal modeling focuses on addressing issues like confounding, selection bias, and measurement error, which can skew interpretations about cause and effect. While predictive accuracy is key in some scenarios, understanding these issues is crucial for making valid causal claims.

A common mistake in modeling is assuming that a model can prove causality. You can have a very performant model, but the model results cannot prove that one variable causes another just because it is highly predictive. There is also nothing in the estimation process that can magically extract a causal relationship even if it exists. Reality is even more complex than our models, and no model can account for every possibility. Causal modeling attempts to account for some of these issues, but it is limited by our own biases in thinking about a particular problem.

Predictive features in a model might reflect a true causal link, act as stand-ins for one, or merely reflect spurious associations. Conversely, true causal effects of a feature may not be large, but it doesn’t mean they are unimportant. Assuming you have done the hard work of developing the causal structure beforehand, model results can provide more confidence in your ultimate causal conclusions, and that is very useful, despite lingering uncertainties.

15.4.3 Random assignment is not enough

Many believe experimental design is the gold standard for making causal claims, and it is certainly a good way to control for various aspects that can make causal claims difficult. Consider a randomized control trial (RCT) where you assign people to a treatment or control group. The left panel shows the overall treatment effect, where the main effect would suggest a causal conclusion of no treatment effect. However, the right panel shows the same treatment effect across another group factor, and it is clear that the treatment effect is not the same across groups.

Figure 15.2: Main Effect vs. Interaction

So random assignment cannot save us from misunderstanding the causal mechanisms at play. Other issues to think about are that the treatment may be implemented poorly, participants may not be compliant, or the treatment may not even be well defined, and these are not uncommon situations. This comes back to having the causal structure understood as best you can before any analysis.

15.4.4 Ignoring causal issues

Causal modeling is concerned with things like confounding, selection bias, measurement error, reverse causality and more. These are all issues that can lead to incorrect causal conclusions. A lot of this can be ignored when predictive performance is of primary importance, and some can be ignored when we are not interested in making causal claims. But when you are interested in making causal claims, you will have some work to do in order for your model to help you make said claims, regardless of the modeling technique you choose to implement. And it doesn’t hurt to be concerned about these issues in non-causal situations.

15.5 Data

When it comes to data, plenty can go wrong before even starting with any modeling attempt. Let’s take a look at some issues that can regularly arise.

15.5.1 Transformations

Many models will fail miserably without some sort of scaling or transformation of the data. A few techniques, like tree-based approaches, do not benefit, but practically all others do. At the very least, models will converge faster and possibly be more interpretable. However, you should generally not use transformations that would lose the expressivity of the data, because as we noted with binarization (Section 15.3.2), some can do more harm than good. But you should always consider the need for transformations, and not just assume that the data is in a form that is ready for modeling.

15.5.2 Measurement error

Measurement error is a common issue in data collection, and it can lead to biased estimates and reduce our ability to detect meaningful feature-target relationships. Generally speaking, the reliability of a feature or target is its ability to measure what it’s supposed to, while measurement error reflects its failure to do so. There is no perfectly measured variable, and measurement error can come from a variety of sources, and be difficult to assess. But it is important to try and understand how well your data reflects the constructs it is supposed to. If you can’t correct for it, for example, by finding better data, you should at least be aware of the issue and consider how they might affect your results. There is a saying about squeezing blood from a stone, or putting lipstick on a pig, or something like that, and it applies here. If your data is poor, your model won’t save it.

15.5.3 Simple imputation techniques

Imputation may be required when you have missing data, but it can be done in ways that don’t help your model. Simple imputation techniques, like using the mean or modal category, can produce faulty, or at best, noisier, results. First you should consider why you want to keep a feature that doesn’t have a lot of data - do you even trust the values that are present? If you really need to impute, use an actual model to do so, but recognize that the resulting value has uncertainty associated with it. There are practical problems with implementing techniques to incorporate the uncertainty (Section 14.3.4), so there is no free lunch there. But at least having a better imputation model will provide a better guess than a mean, and still better is to use a model that would handle the missing values natively, like tree-based methods that can split on the missingness.

15.5.4 Outliers are real!

One common practice in modeling is to drop or modify values considered as “outliers”. However, extreme values in the target variable are often a natural part of the data. Assuming there is no actual error in recording them, often, a simple transformation can address the issue. If extremes persist after modeling, it indicates that the model is unable to capture the underlying data structure, rather than an inherent problem with the data itself. Additionally, even values that may not appear extreme can still have large residuals, so it’s important not to solely focus on just the most extreme observed values.

In terms of features, extreme values can cause strange effects, but often they reflect a data problem (e.g., incorrect values), or can be resolved using the transformations you should already be considering (e.g., taking the log). In other cases, they don’t really cause any modeling problems at all. And again, some techniques are fairly robust to feature extremes, like tree-based methods.

15.5.5 Big data isn’t always as big as you think

Consider a model setting with 100,000 samples. Is this large? Let’s say you have a rare outcome that occurs 1% of the time. This means you have 1000 samples where the outcome label you’re interested in is present. Now consider a categorical feature (A) that has four categories, and one of those categories is relatively small, say 5% of the data, or 5000 cases, and you want to interact it with another categorical feature (B), one whose categories are all equally distributed. Assuming no particular correlation between the two, you’d be down to ~1% of the data for the least category of A across the levels of B. Now if there is an actual interaction effect on the target, some of those interaction cells may have only a dozen or so positive target values. Odds are pretty good that you don’t have enough data to make a reliable estimate of that effect unless it is extremely large.

Oh wait, did you want to use cross-validation also? A simple random CV approach might result in some validation sets with no positive values in those interaction groups at all! Don’t forget that you may have already split your 100,000 samples into training and test sets, so you have even less data to start with! The following table shows the final cell count for a dataset with these properties.

Start N Train N A p B p 5cv Final Cell p Cell N Target N in Cell
100,000 80,000 0.05 0.25 0.20 0.0025 200 2

The point is that it’s easy to forget that large data can get small very quickly due to class imbalance, interactions, etc. There is not much you can do about this, but you should not be surprised when these situations are not very revealing in terms of your model results.

15.6 Wrapping Up

Though we’ve covered many common issues in modeling here, there are plenty more ways we can trip ourselves up. The important thing to remember is that we’re all prone to making and repeating mistakes in modeling. But awareness and effort can go a long way, and we can more easily avoid these problems with practice. The main thing is to try and do better each time, and learn from any mistakes you do make.

15.6.1 The common thread

Many of the issues here are model agnostic and could creep into any modeling exercise you undertake.

15.6.2 Choose your own adventure

If you’ve made it through the previous chapters, there’s only one place to go. But you might revisit some of those in light of the common problems we’ve discussed here.

15.6.3 Additional resources

Mostly we recommend the same resources we did in the corresponding sections of the previous chapters. However, a couple others to consider are:

  • Shalizi (2015) (start with the fantastic concluding comment)
  • Questionable Practices in Machine Learning (Leech et al. 2024)

  1. Granted, if it was a Pinto wagon, the choice could be more difficult.↩︎

  2. Adjusted R2 doesn’t help, the same issues are present and it would not be any practically different than R2 except for very small data situations, where it might even be negative!↩︎

  3. The number is actually dependent on other parameters, like whether early stopping is used, the number of classes, etc.↩︎

  4. As we mentioned in the statistical section, many older methods are still valid and useful. But it’s not clear what would be gained by using things like a basic support vector machine or knn-regression related to more recently developed or other techniques that have shown more flexibility.↩︎