All those causal effects will be lost in time, like tears in rain… without adequate counterfactual considerations.
Roy Batty (paraphrased)
Causal inference is a very important topic in machine learning and statistics, and it is also a very difficult one to understand well, or consistently, because not everyone agrees on how to define a cause in the first place. Our focus here is merely practical- we just want to show some of the common model approaches used when attempting to answer causal questions. But causal modeling in general is such a rabbit hole that we won’t be able to go into much detail, but we will try to give you a sense of the landscape, and some of the key ideas.
12.1 Key Ideas
No model can tell you whether a relationship is causal or not. Causality is inferred, not proven, based on the available evidence.
The exact same models would be used for similar data settings to answer a causal question or a purely predictive question. The primary difference is in the interpretation of the results.
Experimental design, such as randomized control trials, are the gold standard for causal inference. But the gold standard is often not practical, and not without its limitations even when it is.
Causal inference is often done with observational data, which is often the only option, and that’s okay.
Counterfactual thinking is at the heart of causal inference, but is useful for all modeling contexts.
Several models exist which are typically employed to answer a more causal-oriented question. These include structural equation models, graphical models, uplift modeling, and more.
Interactions are the norm, if not the reality. Causal inference generally regards a single effect. If the normal setting is that such an effect would always vary depending on other features, you should question why you want to aggregate your results to a single ‘effect’, since that effect would be potentially misleading.
12.1.1 Why it matters
Often we need a precise statement about the feature-target relationship, not just whether there is some relationship. For example, we might want to know whether a drug works well, or whether showing an advertisement results in a certain amount of new sales. Whether or not random assignment was used, we generally need to know whether the effect is real, and the size of the effect, and often, the uncertainty in that estimate. Causal modeling is, like machine learning, more of an approach than a specific model, and that approach may involve the design or implementing models we’ve already seen in a different way to answer the key question. Without more precision in our understanding, we could miss the effect, or overstate it, and make bad decisions as a result.
12.1.2 Helpful context
This section is pretty high level, and we are not going to go into much detail here so even just some understanding of correlation and modeling would likely be enough.
12.2 Classic Experimental Design
Many are familiar with the basic idea of an experiment, where we have a treatment group and a control group, and we want to measure the difference between the two groups. The ‘treatment’ could regard a new drug, a marketing campaign, or a new app’s feature. If we randomly assign our observational units to the two groups, say, one that gets the new app feature and the other doesn’t, we can be more confident that the two groups are essentially the same aside from the treatment, and that any difference in the outcome, for example, customer satisfaction with the app, is due to the treatment.
This is the basic idea behind a randomized control trial, or RCT. We can randomly assign the groups in a variety of ways, but you can think of it as flipping a coin, and assigning each sample to the treatment when the coin comes up on one side, and to the control when it comes up on the other. The idea is that the only difference between the two groups is the treatment, and so any difference in the outcome can be attributed to the treatment. This is visualized in Figure 12.2, where the colors represent different groups, and the groups are essentially the same aside from the treatment.
Many of those who have taken a statistics course have been exposed to the simple t-test to determine whether two groups are different. The t-test tells us whether the difference in means between the two groups is statistically significant. However, it definitely does not tell us whether the treatment itself caused the difference, whether the effect is large, nor whether the effect is real, or even if the treatment is a good idea to do in the first place. It just tells us whether the two groups are statistically different.
Turns out, a t-test is just a linear regression model. It’s a special case of linear regression where there is only one independent variable, and it is a categorical variable with two levels. The coefficient from the linear regression would tell you the mean difference of the outcome between the two groups. Under the same conditions, the t-statistic from the linear regression and the t-test would have identical statistical results.
Analysis of variance, or ANOVA, allows the t-test to be extended to more than two groups, and multiple features, and is also commonly employed to analyze the results of experimental design settings. But ANOVA is still just a linear regression. Even when we get into more complicated design settings such as repeated measures and mixed design, it’s still just a linear regression, we’d just be using mixed models (Section 8.3).
If linear regression didn’t suggest any notion of causality to you before, it certainly shouldn’t now either. The model is identical whether there was an experimental design with random assignment or not. The only difference is that the data was collected in a different way, and the theoretical assumptions and motivations are different. Even the statistical assumptions are the same whether you use random assignment or there are one or more groups, or whether the treatment is continuous or categorical.
Experimental design1 can give us more confidence in the causal explanation of model results, whatever model is used, and this is why we like to use it when we can. It helps us control for the unobserved factors that might otherwise be influencing the results. If we can be fairly certain the observations are essentially the same except for the treatment, then we can be more confident that the treatment is the cause of the difference, and we can be more confident in the causal interpretation of the results. But it doesn’t change the model itself, and the results of a model don’t prove a causal relationship by themselves. Your experimental study will also be limited by the quality of the data, and the population it generalizes to. Even with strong design and modeling, if care isn’t taken in the modeling process to even assess the generalization of the results (Section 9.4), you may find they don’t hold up2.
A/B Testing
A/B testing is just marketing-speak for a project focused on comparing two groups. It implies randomized assignment, but you’d have to understand the context to know if that is actually the case.
12.3 Natural Experiments
As we noted, random assignment or a formal experiment is not always possible or practical to implement. But sometimes we get to do it anyway, or at least we can get pretty close! Sometimes, the world gives us a natural experiment, where the assignment to the groups is essentially random, or where there is clear break before and after some event occurs, such that we examine the change as we would in pre-post design.
For example, the covid pandemic allowed us to potentially examine vaccination effects, governmental policy effects, the effectiveness of remote work, and more. This was not a tightly controlled experiment, but it’s something we can treat very similar to an experiment, and we can compare the differences in various outcomes before and after the pandemic to see what changes took place.
12.4 Causal Inference
Reasoning about causality is a very old topic, philosophically dating back millennia, and more formally hundreds of years. Random assignment is a relatively new idea, say 150 years old, but was posited even before Wright, Fisher, and Neyman and the 20th century rise of statistics. But with stats and random assignment we had a way to start using models to help us reason about causal relationships. Pearl and others came along to provide a perspective from computer science, and things have been progressing along. We were actually using programming approaches to do causal inference back in the 1970s even! Economists got into the game too (e.g., Heckman), and eventually most scientific academic disciplines were well acquainted with causal inference in some fashion.
Now we can use recently developed modeling approaches to help us reason about causal relationships, which can be both a blessing and a curse. Our models can be more complex, and we can use more data, which can potentially give us more confidence in our conclusions. But we can still be easily fooled by our models, as well as by ourselves. So we’ll need to be careful in how we go about things, but let’s see what some of our options are!
12.5 Models for Causal Inference
Any model can be used to answer a causal question, and which one you use will depend on the data setting and the question you are asking. The following covers a few models that might be seen in various academic and professional settings.
12.5.1 Linear regression
Yep, linear regression. The old standby is possibly the mostly widely used model for causal inference, historically speaking and even today. We’ve seen linear regression as a graphical model Figure 3.1, and in that sense, it can serve as the starting point for structural equation models and related models that we’ll talk about next that many consider to be true causal models. It can also be used as a baseline model for other more complex causal model approaches. Linear regression can potentially tell us for any particular feature, what that feature’s relationship with the target is, holding the other features constant. This ceteris paribus interpretation - ‘all else being equal’ - already gets us into a causal mindset.
However, your standard linear model doesn’t care where the data came from or what the underlying structure should be. It only does what you ask of it, and will tell you about group differences whether they come from a randomized experiment or not. For example, if you don’t include features that would have a say in how the treatment comes about (confounders), you could get a biased estimate of the effect3. Basic linear regression also cannot tell you whether X is the effect of Y or vice versa.
As an example, let’s consider a simple linear model with a confounder. We’ll generate some synthetic data with a confounder, and fit two models, one with the confounder and one without. We’ll compare the coefficients of the feature of interest, x, in both models.
# Set seed for reproducibilityset.seed(42)# Generate synthetic datan =100z =rnorm(n) # the confounder x =2* z +rnorm(n) # the confounded featurey =3* z +rnorm(n) # the targetdata =tibble(x = x, y = y, z = z)# Fit linear modelsmodel_without_z =lm(y ~ x, data = data)model_with_z =lm(y ~ x + z, data = data)# Compare x coefficientsc(coef(model_without_z)['x'], coef(model_with_z)['x'])
x x
1.20495 0.08529
import numpy as npimport pandas as pdfrom sklearn.linear_model import LinearRegression# Set seed for reproducibilitynp.random.seed(42)# Generate synthetic datan =100z = np.random.normal(size=n) # the confounderx =2* z + np.random.normal(size=n) # the confounded featurey =3* z + np.random.normal(size=n) # the targetdata = pd.DataFrame({'x': x, 'y': y, 'z': z})# Fit linear modelsmodel_without_z = LinearRegression().fit(data[['x']], data['y'])model_with_z = LinearRegression().fit(data[['x', 'z']], data['y'])# Compare x coefficientsmodel_without_z.coef_[0].round(3), model_with_z.coef_[0].round(3)
(1.32, -0.012)
The results show that the coefficient of x is different in the two models. If we don’t include the confounder, the feature’s relationship with the target, which in this case is zero, is a reflection of the correlation it has with the confounder, which is also correlated with the target. Without including the confounder, we can come away with the wrong conclusion about the relationship between x and y.
As we can see from this simple demo, linear regression by itself cannot save us from the difficulties of causal inference, nor really can be considered a causal model. But it can be useful as a starting point in conjunction with other approaches.
Weighting and Sampling Methods
Common techniques for traditional statistical models used for causal inference include a variety of weighting or sampling methods. These methods are used to adjust the data so that the ‘treatment’ groups are more similar, and its effect can be more accurately estimated. Sampling methods include techniques such as stratification and matching, which focus on the selection of the sample as a means to balance treatment and control groups. Weighting methods include inverse probability weighting and propensity score weighting, which focus on adjusting the weights of the observations to make the groups more similar.
These methods are not models themselves, and potentially can be used with just about any model that attempts to estimate the effect of a treatment. An excellent overview of using such methods vs. standard regression/ML can be found on Cross Validated (https://stats.stackexchange.com/a/544958).
Graphical and Structural Equation Models (SEM) are flexible approaches to regression and classification (see Figure 12.1), and have one of the longest histories of formal statistical modeling, dating back over a century4. They are widely employed in the social sciences, and are often used to model both observed and latent variables (Section 13.9), with either serving as features or targets. They are also used to model causal relationships, to the point that historically they were even called ‘causal graphical models’ or ‘causal structural models’. SEMs are a special case of graphical models, which are common tools in computer science and non-social science disciplines.
Formal graphical models like SEM provide a much richer set of tools for controlling various confounding, interaction, and indirect effects than simpler linear models. For this reason, they can be very useful for causal inference. Unfortunately for those looking for causal effects, the basic input for SEM is a correlation matrix, and the basic output is a correlation matrix. Insert your favorite modeling quote here - you know which one! The point is that SEM, like linear regression, can no more tell you whether a relationship is causal than the linear regression or t-test could5.
Causal Language
It’s often been suggested that we keep certain phrasing (e.g. feature X has an effect on target Y) only for the causal model setting. But the model we use can only tell us that the data is consistent with the effect we’re trying to understand, not that it actually exists. In everyday language, we often use causal language whenever we think the relationship is or should be causal, and that’s fine, and we think that’s okay in a modeling context too, as long as you are clear about the limits of your generalizability.
12.5.3 Counterfactual thinking
When we think about causality, we really ought to think about counterfactuals. What would have happened if I had done something different? What would have happened if I had done something sooner rather than later? What would have happened if I had done nothing at all? It’s natural to question our own actions in this way, but we can think like this in a modeling context too. In terms of our treatment effect example, we can summarize counterfactual thinking as:
The question is not whether there is a difference between A and B but whether there would still be a difference if A was B and B was A.
This is the essence of counterfactual thinking. It’s not about whether there is a difference between two groups, but whether there would still be a difference if those in one group had actually been treated differently. In this sense, we are concerned with the potential outcomes of the treatment, however defined.
Here is a more concrete example:
Roy is shown ad A, and buys the product.
Pris is shown ad B, and does not buy the product.
What are we to make of this? Which ad is better? A seems to be, but maybe Pris wouldn’t have bought the product if shown that ad either, and maybe Roy would have bought the product if shown ad B too! With counterfactual thinking, we are concerned with the potential outcomes of the treatment, which in this case is whether or not to show the ad.
Let’s say ad A is the new one, i.e., our treatment group, and B is the status quo ad, our control group. Without randomization, our real question can’t be answered by a simple test of whether means or predictions are different among the two groups, as this estimate would be biased if the groups are already different in some way to start with. The real effect is whether, for those who saw ad A, what the difference in the outcome would be if they hadn’t seen it.
From a prediction stand point, we can get an initial estimate straightforwardly. We demonstrated this before in Section 5.6, but can revisit it briefly here. For those in the treatment, we can just plug in their feature values with treatment set to ad A. Then we just make a prediction with treatment set to ad B6.
predict(model, X |>mutate(treatment ='A')) -predict(model, X |>mutate(treatment ='B'))
With counterfactual thinking explicitly in mind, we can see that the difference in predictions is the difference in the potential outcomes of the treatment. This is a very simple demo to illustrate how easy it is to start getting some counterfactual results from our models. But it’s typically not quite that simple in practice, and there are many ways to get this estimate wrong as well. As in other circumstances, the data, or our assumptions about the problem can potentially lead us astray. Assuming those aspects of our modeling endeavor are in order, this is one way to get an estimate of the causal effect of the treatment.
12.5.4 Uplift modeling
The counterfactual prediction we just did provides a result that can be called the uplift or gain from the treatment, especially when compared to a baseline metric. Uplift modeling is a general term applied to models where counterfactual thinking is at the forefront, especially in a marketing context. Uplift modeling is not a specific model per se, but any model that is used to answer a question about the potential outcomes of a treatment. The key question is what is the gain, or uplift, in applying a treatment vs. not? Typically any statistical model can be used to answer this question, and often the model is a classification model, whether Roy both the product or not.
It is common in uplift modeling to distinguish certain types of individuals or instances, and we think it’s useful to extend this to other modeling contexts as well. In the context of our previous example they are:
Sure things: those who would buy the product whether or not shown the ad.
Lost causes: those who would not buy the product whether or not shown the ad.
Sleeping dogs: those who would buy the product if not shown the ad, but not if they are shown the ad. Also referred to as the ‘Do not disturb’ group!
Persuadables: those who would buy the product if shown the ad, but not if not shown the ad.
We can generalize beyond the marketing context to just think about response to any treatment we might be interested in. It’s worthwhile to think about which aspects of your data could correspond to these groups. One of the additional goals in uplift modeling is to identify persuadables for additional treatment efforts, and to avoid wasting money on the lost causes. But to get there, we have to think causally first!
Uplift Modeling in R and Python
There are more widely used tools for uplift modeling and meta-learners in Python than in R, but there are some options in R as well. In Python you can check out causalml and sci-kit uplift for some nice tutorials and documentation.
12.5.5 Meta-Learning
Meta-learners are used in machine learning contexts to assess potentially causal relationships between some treatment and outcome. The core model can actually be any kind you might want to use, but in which extra steps are taken to assess the causal relationship. The most common types of meta-learners are:
S-learner - single model for both groups; predict the (counterfactual) difference as when all observations are treated vs when all are not, similar to our previous code demo.
T-learner - two models, one for each of the control and treatment groups; predict the values as if all observations are ‘treated’ versus when all are ‘control’ using both models, and take the difference.
X-learner - a more complicated modification to the T-learner also using a multi-step approach.
Some additional variants of these models exist, and they can be used in a variety of settings, not just uplift modeling. The key idea is to use the model to predict the potential outcomes of the treatment, and then to take the difference between the two predictions as the causal effect.
Meta-Learners vs. Meta-Analysis
Meta-learners are not to be confused with meta-analysis, which is also related to understanding causal effects. Meta-analysis attempts to combine the results of multiple studies to get a better estimate of the true effect. The studies are typically conducted by different researchers and in different settings. The term meta-learning has also been used to refer to what is more commonly called ensemble learning, the approach used in random forests and boosting. It is also used by other people that don’t bother to look things up before naming their technical terms.
12.5.6 Others models used for causal inference
Note that there are many models that would fall under the umbrella of causal inference, and several that are discipline specific, but really are only a special application of some of the ones we’ve already seen. A few you might come across:
G-computation, doubly robust estimation, targeted maximum likelihood estimation7
Marginal structural models
Instrumental variables and two-stage least squares
Propensity score matching/weighting
Regression discontinuity design
Mediation/moderation analysis
Meta-analysis
Bayesian networks
In general, any modeling technique might be employed as part of a causal modeling approach. To actually make causal statements, you’ll generally need to ensure that the assumptions for those claims are tenable.
12.6 Prediction and Explanation Revisited
We introduced the idea of prediction and explanation in the context of linear models in Section 3.4.3, and it’s worth revisiting here. One attribute of a causal model is an intense focus on the explanatory power of the model. We want to demonstrate that there is a relationship between usually a single feature and the target, and we want to know the precise manner of this relationship as much as possible. Even if we use complex models, e.g., as with meta-learners, the endeavor is to explain the specifics.
Let’s say that we used some particular causal modeling approach to explain a feature-target relationship in a classification setting. We have 10,000 observations, and the baseline rate of the target is about ~50%. We have a model that predicts the target y based on the feature of interest x, and we may have used something like propensity score weighting or some other technique to help control for confounding.
The coefficient, though small with an odds ratio of 1.05, is statistically significant (take our word for it), and we can see a slight positive relationship. Under certain settings, such as this where we are interested in causal effects and where we have controlled for various other factors to obtain this result, we might be satisfied with interpreting this positive relationship.
But if we are interested in predictive performance, we would be disappointed with this model. It predicts the target at about the same rate as guessing, even on the data it’s fit on, and does even worse with new data. Even the effect as shown is quite small by typical standards, as it would take a standard deviation change in the feature to get a ~1% change in the probability of the target (x is standardized).
If we are concerned solely with explanation, we now would want to ask ourselves first if we can trust our result based on the data, model, and various issues that went into producing it. If so, we can then see if the effect is large enough to be of interest, and if the result is useful in making decisions8. It may very well be, maybe the target concerns the rate of survival, where any increase is worthwhile. Or perhaps the data circumstances demand such interpretation, because it is extremely costly to obtain more. For more exploratory efforts however, this sort of result would likely not be enough to come to any strong conclusion even if explanation is the only goal.
As another example, consider the world happiness data we’ve used in previous demonstrations. We want to explain the association of country level characteristics and the population’s happiness. We likely aren’t going to be as interested in predicting next year’s happiness score, but rather what attributes are correlated with a happy populace in general. In this election year (2024) in the U.S., we’d be interested in specific factors related to presidential elections, of which there are relatively very few data points. In these cases, explanation is the focus, and we may not even need a model at all to come to our conclusions.
So we can understand that in some settings we may be more interested in understanding the underlying mechanisms of the data, as with these examples, and in others we may be more interested in predictive performance, as in our demonstrations of machine learning. However, the distinction between prediction and explanation in the end is a bit problematic, not the least of which is that we often want to do both.
Although it’s often implied as such, prediction is not just what we do with new data. It is the very means by which we get any explanation of effects via coefficients, marginal effects, visualizations, and other model results. Additionally, where the focus is on predictive performance, if we can’t explain the results we get, we will typically feel dissatisfied, and may still question how well the model is actually doing.
Here are some ways we might think about different modeling contexts:
Descriptive Analysis: A description of data with no modeling focus. We’ll use descriptive statistics and visualizations to understand the data. An end product may be an infographic or a report. Even here we may still use models to aid visualizations or otherwise to help us understand the data better.
Exploratory Modeling: Using models for exploration. Focus should be on both prediction and explanation. The former can help inform the strength of the results for future exploration.
Causal Modeling: Using models to understand causal effects. We focus on explanation, and prediction on the current data. We may very well be interested in predictive performance also, and often are in industry.
Generalization: When our goal is generalizing to unseen data, the focus is always on predictive performance. This does not mean we can’t use the model to understand the data though, and explanation could possibly be as important.
Depending on the context, we may be more interested explanation or predictive performance, but in practice we often, and usually, want both. It is crucial to remind yourself why you are interested in the problem, what a model is capable of telling you about it, and to be clear about what you want to get out of the result.
12.7 Wrapping Up
We’ve been pretty loose in our presentation here, and glossed over many details with causal modeling. Our main goal is to give you some idea of the domain, but more so the models used and things to think about when you want to answer a causal question with your data.
Models used in statistical analysis and machine learning are not causal models, but when we take a causal model from the realm of ideas and apply it to the real world, a causal model becomes a statistical/ML model with more assumptions, and with additional steps taken to address those assumptions9. These assumptions are required in order to make stronger causal statements, but neither the assumptions, data, nor model can prove that the underlying theory is causally correct. Things like random assignment, sampling, a complex model and good data can possibly help the situation, but they can’t save you from a fundamental misunderstanding of the problem, or data that may still be consistent with that misunderstanding. Nothing about employing a causal model inherently makes better predictions either.
Causal modeling is hard, and most of the difficulty lies outside of the realm of models and data. The model implemented reflects the causal theory, which can be a correct or incorrect idea about how the world works. In the end, the main thing is that when we want to make causal statements, we’ll make do with what data we have, and be careful that we rule out some of the other obvious explanations and issues. The better we can control the setting, or the better we can do things from a modeling standpoint, the more confident we can be in making causal claims. Causal modeling is an exercise in reasoning, which makes it such an interesting endeavor.
12.7.1 The common thread
Engaging in causal modeling may not even require you to learn any new models, but you will typically have to do more to be able to make causal statements. The key is to think about the problem in a different way, and to be more careful about the assumptions you are making. You may need to do more to ensure that your data and model are consistent with the assumptions you are making.
12.7.2 Choose your own adventure
From here you might revisit some of the previous models and think about how you might use them to answer a causal question. You might also look into some of the other models we’ve mentioned here, and see how they are used in practice via the additional resources.
12.7.3 Additional resources
We have only scratched the surface here, and there is a lot more to learn. Here are some resources to get you started:
If you look into causal modeling, you’ll find mention of problematic covariates such as colliders or confounders. Can you think of a way to determine if something is a collider or confounder that would not involve a statistical approach or model?
Barrett, Malcolm, Lucy D’Agostino McGowan, and Travis Gerke. 2024. Causal Inference in R. https://www.r-causal.org/.
Chernozhukov, Victor, Christian Hansen, Nathan Kallus, Martin Spindler, and Vasilis Syrgkanis. 2024. “Applied CausalInferencePowered by ML and AI.” arXiv. http://arxiv.org/abs/2403.02467.
Hernán, Miguel A. 2018. “The C-Word: ScientificEuphemismsDoNotImproveCausalInferenceFromObservationalData.”American Journal of Public Health 108 (5): 616–19. https://doi.org/10.2105/AJPH.2018.304337.
Note that experimental design is not just any setting that uses random assignment, but more generally how we introduce control in the sample settings.↩︎
Many experimental design settings involve sometimes very small samples due to the cost of the treatment implementation and other reasons. This often limits exploration of more complex relationships (e.g. interactions), and it is relatively rare to see any assessment of performance generalization. It would probably worry many to know how many experimental results are based on p-values with small data, and this is the part of the problem seen with the replication crisis in science.↩︎
A reminder that a conclusion of ‘no effect’ is also a causal statement, and can be just as biased as any other statement. Also, you can come to the same practical conclusion with a biased estimate as with an unbiased one.↩︎
Wright is credited with coming up with what would be called path analysis in the 1920s, which is a precursor to and part of SEM.↩︎
Your authors have to admit some bias here. We’ve spent a lot of our past dealing with SEMs, and almost every application we saw had too little data and too little generalization, and were grossly overfit. Many SEM programs even added multiple ways to overfit the data even further, and it is difficult to trust the results reported in many papers that used them. But that’s not the fault of SEM in general- it can be a useful tool when used correctly, and it can help answer causal questions, but it is not a magic bullet, and it doesn’t make anyone look fancier by using it.↩︎
This is basically the S-Learner approach to meta-learning, which we’ll discuss in a bit. It is generally too weak↩︎
The G-computation approach and S-learners are essentially the same approach, but came about from different domain contexts.↩︎
This is a contrived example, but it is definitely something that you might see in the wild. The relationship is weak, and though statistically significant, the model can’t predict the target well at all. The statistical power is actually decent in this case, roughly 70%, but this is mainly because the sample size is so large and it is a very simple model setting. This is a common issue in many academic fields, and it’s why we always need to be careful about how we interpret our models. In practice, we would generally need to consider other factors, such as the cost of a false positive or false negative, or the cost of the data and running the model itself, to determine if the model is worth using.↩︎
Gentle reminder that making an assumption does not mean the assumption is correct, or even provable.↩︎