10  Causal Modeling

All those causal effects will be lost in time, like tears in rain… without adequate counterfactual considerations.
Roy Batty (paraphrased)
Figure 10.1: A causal DAG

Causal inference is a very important topic in machine learning and statistics, and it is also a very difficult one to understand well, or consistently, because not everyone agrees on how to define a cause in the first place. Our focus here is merely practical- we just want to show how it can play out in the modeling landscape from a very high level overview. This is such a rabbit hole that we won’t be able to go into much detail, but we will try to give you a sense of the landscape, and some of the key ideas.

10.1 Key ideas

  • No model can tell you whether a relationship is causal or not. Causality is inferred, not proven, based on the available evidence.
  • The exact same models would be used for similar data settings to answer a causal question or a purely predictive question. The primary difference is in the interpretation of the results.
  • Experimental design, such as randomized control trials, are the gold standard for causal inference. But the gold standard is often not practical, and not without its limitation even when it is.
  • Causal inference is often done with observational data, which is often the only option, and that’s okay.
  • Counterfactual thinking is at the heart of causal inference, but is useful for all modeling contexts.
  • Several models exist which are typically employed to answer a more causal-oriented question. These include structural equation models, graphical models, uplift modeling, and more.
  • Interactions are the norm, if not the reality. Causal inference generally regards a single effect. If the normal setting is that such an effect would always vary depending on other features, you should question why you want to aggregate your results to a single ‘effect’, since that effect would be potentially misleading.

10.1.1 Why it matters

Often we need a precise statement about the feature-target relationship, not just whether there is some relationship. For example, we might want to know whether a drug works well, or whether showing an advertisement results in a certain amount of new sales. Whether or not random assignment was used, we generally need to know whether the effect is real, and the size of the effect, and often, the uncertainty in that estimate. Causal modeling is, like machine learning, more of an approach than a specific model, and that approach may involve the design or implementing models we’ve already seen in a different way to answer the key question. Without more precision in our understanding, we could miss the effect, or overstate it, and make bad decisions as a result.

10.1.2 Good to know

This section is pretty high level, and we are not going to go into much detail here so even just some understanding of correlation and modeling would likely be enough.

10.2 Classic Experimental Design

Figure 10.2: Random Assignment

Many are familiar with the basic idea of an experiment, where we have a treatment group and a control group, and we want to measure the difference between the two groups. The ‘treatment’ could regard a new drug, a marketing campaign, or a new app’s feature. If we randomly assign our observational units to the two groups, say, one that gets the new app feature and the other doesn’t, we can be more confident that the two groups are essentially the same aside from the treatment, and that any difference in the outcome, for example, customer satisfaction with the app, is due to the treatment.

This is the basic idea behind a randomized control trial, or RCT. We can randomly assign the groups in a variety of ways, but you can think of it as flipping a coin, and assigning each sample to the treatment when the coin comes up on one side, and to the control when it comes up on the other. The idea is that the only difference between the two groups is the treatment, and so any difference in the outcome can be attributed to the treatment. This is visualized in Figure 10.2, where the colors represent different groups, and the groups are essentially the same aside from the treatment.

Many of those who have taken a statistics course have been exposed to the simple t-test to determine whether two groups are different. The t-test tells us whether the difference in means between the two groups is statistically significant. However, it definitely does not tell us whether the treatment itself caused the difference, whether the effect is large, nor whether the effect is real, or even if the treatment is a good idea to do in the first place. It just tells us whether the two groups are statistically different.

Turns out, a t-test is just a linear regression model. It’s a special case of linear regression where there is only one independent variable, and it is a categorical variable with two levels. The coefficient from the linear regression would tell you the mean difference of the outcome between the two groups. Under the same conditions, the t-statistic from the linear regression and the t-test would have identical statistical results.

Analysis of variance, or ANOVA, allows the t-test to be extended to more than two groups, and multiple features, and is also commonly employed to analyze the results of experimental design settings. But ANOVA is still just a linear regression. Even when we get into more complicated design settings such as repeated measures and mixed design, it’s still just a linear regression, we’d just be using mixed models (Section 5.3).

If using a linear regression didn’t suggest any notion of causality to you before, it certainly shouldn’t now either. The model is identical whether there was an experimental design with random assignment or not. The only difference is that the data was collected in a different way, and the theoretical assumptions and motivations are different. Even the statistical assumptions are the same whether you use random assignment or the there are one or more groups, or whether the treatment is continuous or categorical.

Experimental design1 can give us more confidence in the causal explanation of model results, whatever model is used, and this is why we like to use it when we can. It helps us control for the unobserved factors that might otherwise be influencing the results. If we can be fairly certain the observations are essentially the same except for the treatment, then we can be more confident that the treatment is the cause of the difference, and we can be more confident in the causal interpretation of the results. But it doesn’t change the model itself, and the results of a model don’t prove a causal relationship by themselves. Your experimental study will also be limited by the quality of the data, and the population it generalizes to. Even with strong design and modeling, if care isn’t taken in the modeling process to even assess the generalization of the results (Section 6.4), you may find they don’t hold up2.

A/B testing is just marketing-speak for a project focused on comparing two groups. It implies randomized assignment, but you’d have to understand the context to know if that is actually the case.

10.3 Natural Experiments

Figure 10.3: Covid Vaccinations and Deaths in the US

As we noted, random assignment or a formal experiment is not always possible or practical to implement. But sometimes we get to do it anyway, or at least we can get pretty close! Sometimes, the world gives us a natural experiment, where the assignment to the groups is essentially random, or where there is clear break before and after some event occurs, such that we examine the change as we would in pre-post design.

For example, the covid pandemic allowed us to potentially examine vaccination effects, governmental policy effects, the effectiveness remote work, and more. This was not a tightly controlled experiment, but it’s something we can treat very similar to an experiment, and we can compare the differences in various outcomes before and after the pandemic to see what changes took place.

10.4 Causal Inference

Reasoning about causality is a very old topic, philosophically dating back millennia, and more formally hundreds of years. Random assignment is a relatively new idea, say 150 years old, but was posited even before Wright, Fisher, and Neyman and the 20th century rise of statistics. But with stats and random assignment we had a way to start using models to help us reason about causal relationships. Pearl and others came along to provide a perspective from computer science, and things have been progressing along. We were actually using programming approaches to do causal inference back in the 1970s even! Economists got into the game too (e.g., Heckman), and eventually most scientific academic disciplines were well acquainted with causal inference in some fashion.

Now we can use recently developed modeling approaches to help us reason about causal relationships, which can be both a blessing and a curse. Our models can be more complex, and we can use more data, which can potentially give us more confidence in our conclusions. But we can still be easily fooled by our models, as well as by ourselves. So we’ll need to be careful in how we go about things, but let’s see what some of our options are!

10.5 Models for Causal Inference

Any model can be used to answer a causal question, and which one you use will depend on the data setting and the question you are asking. The following covers a few models that might be seen in various academic and professional settings.

10.5.1 Linear Regression

Yep, linear regression. The old standby is possibly the mostly widely used model for causal inference, historically speaking and even today. We’ve seen linear regression as a graphical model Figure 1.2, and in that sense, it can serve as the starting point for structural equation models and related models that we’ll talk about next that many consider to be true causal models. It can also be used as a baseline model for other more complex causal model approaches. Linear regression can potentially tell us for any particular feature, what that feature’s relationship with the target is, holding the other features constant. This ceteris paribus interpretation - ‘all else being equal’ - already gets us into a causal mindset.

However, your standard linear model doesn’t care where the data came from or what the underlying structure should be. It only does what you ask of it, and will tell you about group differences whether they come from a randomized experiment or not. For example, if you don’t include features that would have a say in how the treatment comes about (counfounders), you could get a biased estimate of the effect3. Basic linear regression also cannot tell you whether X is the effect of Y or vice versa. As such, linear regression by itself cannot save us from the difficulties of causal inference, nor really be considered a causal model. But it can be useful as a starting point in conjunction with other approaches.

Common techniques for traditional statistical models used for causal inference include a variety of weighting or sampling methods. These methods are used to adjust the data so that the ‘treatment’ groups are more similar, and its effect can be more accurately estimated. Sampling methods include techniques such as stratification and matching, which focus on the selection of the sample as a means to balance treatment and control groups. Weighting methods include inverse probability weighting and propensity score weighting, which focus on adjusting the weights of the observations to make the groups more similar.

These methods are not models themselves, and potentially can be used with just about any model that attempts to estimate the effect of a treatment.

10.5.2 Graphical Models & Structural Equation Models

Graphical and Structural Equation Models (SEM) are flexible approach to regression and classification (see Figure 10.1), and have one of the longest histories of formal statistical modeling, dating back over a century4. They are widely employed in the social sciences, and are often used to model both observed and latent variables (Section 9.9), with either serving as features or targets. They are also used to model causal relationships, to the point that historically they were even called ‘causal graphical models’ or ‘causal structural models’. SEMs are a special case of graphical models, which are common tools in computer science and non-social science disciplines.

Formal graphical models like SEM provide a much richer set of tools for controlling various confounding, interaction, and indirect effects than simpler linear models. For this reason, they can very useful for causal inference. Unfortunately for those looking for causal effects, the basic input for SEM is a correlation matrix, and the basic output is a correlation matrix. Insert your favorite modeling quote here - you know which one! The point is that SEM, like linear regression, can no more tell you whether a relationship is causal than the linear regression or t-test could5.

It’s often been suggested that we keep certain phrasing (e.g. feature X has an effect on target Y) only for the causal model setting. But the model we use can only tell us that the data is consistent with the effect we’re trying to understand, not that it actually exists. In everyday language, we often use causal language whenever we think the relationship is or should be causal, and that’s fine, and we think that’s okay in a modeling context too, as long as you are clear about the limits of your generalizability.

10.5.3 Counterfactual Thinking

When we think about causality, we really out to think about counterfactuals. What would have happened if I had done something different? What would have happened if I had done something sooner rather than later? What would have happened if I had done nothing at all? It’s natural to question our own actions in this way, but we can think like this in a modeling context too. In terms of our treatment effect example, we can summarize counterfactual thinking as:

the question is not whether there is a difference between A and B but whether there would still be a difference if A was B and B was A.

This is the essence of counterfactual thinking. It’s not about whether there is a difference between two groups, but whether there would still be a difference if those in one group had actually been treated differently. In this sense, we are concerned with the potential outcomes of the treatment, however defined.

Here is a more concrete example:

  • Roy is shown ad A, and buys the product.
  • Pris is shown ad B, and does not buy the product.

What are we to make of this? Which ad is better? A seems to be, but maybe Pris wouldn’t have bought the product if shown that ad either, and maybe Roy would have bought the product if shown ad B too! With counterfactual thinking, we are concerned with the potential outcomes of the treatment, which in this case is whether or not to show the ad.

Let’s say ad A is the new one, i.e., our treatment group, and B is the status quo ad, our control group. Without randomization, our real question can’t be answered by a simple test of whether means or predictions are different among the two groups, as this estimate would be biased if the groups are already different in some way to start with. The real effect is whether, for those who saw ad A, what the difference in the outcome would be if they hadn’t seen it.

From a prediction stand point, we can get an initial estimate straightforwardly. We demonstrated this before in Section 2.3.4, but can revisit it briefly here. For those in the treatment, we can just plug in their feature values with treatment set to ad A. Then we just make a prediction with treatment set to ad B6.

model.predict(X.assign(treatment = 'A')) - 
    model.predict(X.assign(treatment = 'B'))
predict(model, X |> mutate(treatment = 'A')) - 
    predict(model, X |> mutate(treatment = 'B'))

With counterfactual thinking explicitly in mind, we can see that the difference in predictions is the difference in the potential outcomes of the treatment. This is a very simple demo to illustrate how easy it is to start getting some counterfactual results from our models. But it’s typically not quite that simple in practice, and there are many ways to get this estimate wrong as well. As in other circumstances, the data, or our assumptions about the problem can potentially lead us astray. Assuming those aspects of our modeling endeavor are in order, this is one way to get an estimate of the causal effect of the treatment.

10.5.4 Uplift Modeling

The counterfactual prediction we just did provides a result that can be called the uplift or gain from the treatment, especially when compared to a baseline metric. Uplift modeling is a general term applied to models where counterfactual thinking is at the forefront, especially in a marketing context. Uplift modeling is not a specific model per se, but any model that is used to answer a question about the potential outcomes of a treatment. The key question is what is the gain, or uplift, in applying a treatment vs. not? Typically any statistical model can be used to answer this question, and often the model is a classification model, whether Roy both the product or not.

It is common in uplift modeling to distinguish certain types of individuals or instances, and we think it’s useful to extend this to other modeling contexts as well. In the context of our previous example they are:

  • Sure things: those who would buy the product whether or not shown the ad.
  • Lost causes: those who would not buy the product whether or not shown the ad.
  • Sleeping dogs: those who would buy the product if not shown the ad, but not if they are shown the ad. Also referred to as the ‘Do not disturb’ group!
  • Persuadables: those who would buy the product if shown the ad, but not if not shown the ad.

We can generalize beyond the marketing context to just think about response to any treatment we might be interested in. It’s worthwhile to think about which aspects of your data could correspond to these groups. One of the additional goals in uplift modeling is to identify persuadables for additional treatment efforts, and to avoid wasting money on the lost causes. But to get there, we have to think causally first!

There are more widely used tools for uplift modeling and meta-learners in Python than in R, but there are some options in R as well. In Python you can check out causalml and sci-kit uplift for some nice tutorials and documentation.

10.5.5 Meta-Learning

Meta-learners are used in machine learning contexts to assess potentially causal relationships between some treatment and outcome. The core model can actually be any kind you might want to use, but in which extra steps are taken to assess the causal relationship. The most common types of meta-learners are:

  • S-learner - single model for both groups; predict the (counterfactual) difference as when all observations are treated vs when all are not, similar to our code demo above.
  • T-learner - two models, one for each treatment group; predict the difference as when all observations are treated vs when all are not for both models, and take the difference
  • X-learner - a more complicated modification to the T-learner also using a multi-step approach.

Some additional variants of these models exist, and they can be used in a variety of settings, not just uplift modeling. The key idea is to use the model to predict the potential outcomes of the treatment, and then to take the difference between the two predictions as the causal effect.

Meta-learners are not to be confused with meta-analysis, which is also related to understanding causal effects. Meta-analysis attempts to combine the results of multiple studies to get a better estimate of the true effect. The studies are typically conducted by different researchers and in different settings. The term meta-learning has also been used to refer to what is more commonly called ensemble learning, the approach used in random forests and boosting. It is also used by other people that don’t bother to look things up before naming their technical terms.

10.5.6 Others Models Used for Causal Inference

Note that there are many models that would fall under the umbrella of causal inference, and several that are discipline specific, but really are only a special application of some of the ones we’ve already seen. A few you might come across:

  • Marginal structural models
  • Instrumental variables and two-stage least squares
  • Propensity score matching/weighting
  • Regression discontinuity design
  • Mediation/moderation analysis
  • Meta-analysis
  • Bayesian networks

In general, any modeling technique might be employed as part of a causal modeling approach. To actually makes causal statements, you’ll generally need to ensure that the assumptions for those claims are tenable.

10.6 Wrapping Up

We’ve been pretty loose in our presentation here, and glossed over many details with causal modeling. Our main goal is to give you some idea of the domain, but more so the models used and things to think about when you want to answer a causal question with your data.

Models used in statistical analysis and machine learning are not causal models, but when we take a causal model from the realm of ideas and apply it to the real world, a causal model becomes a statistical/ML model with more assumptions, and with additional steps taken to address those assumptions7. These assumptions are required in order to make stronger causal statements, but neither the assumptions, data, nor model can prove that the underlying theory is causally correct. Things like random assignment, sampling, a complex model and good data can possibly help the situation, but they can’t save you from a fundamental misunderstanding of the problem, or data that may still be consistent with that misunderstanding. Nothing about employing a causal model inherently makes better predictions either.

Causal modeling is hard, and most of the difficulty lies outside of the realm of models and data. The model implemented reflects the causal theory, which can be a correct or incorrect idea about how the world works. In the end, the main thing is that when we want to make causal statements, we’ll make do with what data we have, and be careful that we rule out some of the other obvious explanations and issues. The better we can control the setting, or the better we can do things from a modeling standpoint, the more confident we can be in making causal claims. Causal modeling is an exercise in reasoning, which makes it such an interesting endeavor.

10.6.1 The Common Thread

Engaging in causal modeling may not even require you to learn any new models, but you will typically have to do more to be able to make causal statements. The key is to think about the problem in a different way, and to be more careful about the assumptions you are making. You may need to do more to ensure that your data and model are consistent with the assumptions you are making.

10.6.2 Choose Your Own Adventure

From here you might revisit some of the previous models and think about how you might use them to answer a causal question. You might also look into some of the other models we’ve mentioned here, and see how they are used in practice via the additional resources below.

10.6.3 Additional Resources

We have only scratched the surface here, and there is a lot more to learn. Here are some resources to get you started:

10.7 Exercise

If you look into causal modeling, you’ll find mention of problematic covariates such as colliders or confounders. Can you think of a way to determine if something is a collider or confounder that would not involve a statistical approach or model?


  1. Note that experimental design is not just any setting that uses random assignment, but more generally how we introduce control in the sample settings.↩︎

  2. Many experimental design settings involve sometimes very small samples due to the cost of the treatment implementation and other reasons. This often limits exploration of more complex relationships (e.g. interactions), and it is relatively rare to see any assessment of performance generalization. It would probably worry many to know how many experimental results are based on p-values with small data, and this is the part of the problem seen with the replication crisis in science.↩︎

  3. A reminder that a conclusion of ‘no effect’ is also a causal statement, and can be just as biased as any other statement. Also, you can come to the same practical conclusion with a biased estimate as with an unbiased one.↩︎

  4. Wright is credited with coming up with what would be called path analysis in the 1920s, which is a precursor to and part of SEM.↩︎

  5. Your authors have to admit some bias here. We’ve spent a lot of our past dealing with SEMs, and almost every application we saw had too little data and too little generalization, and were grossly overfit. Many SEM programs even added multiple ways to overfit the data even further, and it is difficult to trust the results reported in many papers that used them. But that’s not the fault of SEM in general- it can be a useful tool when used correctly, and it can help answer causal questions, but it is not a magic bullet, and it doesn’t make anyone look fancier by using it.↩︎

  6. This is basically the S-Learner approach to meta-learning, which we’ll discuss in a bit. It is generally too weak↩︎

  7. Gentle reminder that making an assumption does not mean the assumption is correct, or even provable.↩︎