The following is a summary of Pearl’s 2014 and 2013 technical reports on some modeling situations that lead to surprising results that are initially at odds with out intuition.
Two statisticians are interested in determining whether there are differences among males and females in weight gain over the course of semester. In the original depiction by Lord, there is an implication of some diet treatment, but all that can be assumed is that those under consideration both received the ‘treatment’ if there was one. The two statisticians take different approaches to examining the data, yet come to different conclusions.
The following graph is from Pearl 2014. The ellipses represent the scatter plots for boys and girls. The diagonal 45o degree line would represent no change from time 1 to time 2. The center of the ellipses are both on this line, and thus the mean change for boys and girls are identical and zero. The density plot in the lower left depicts the distribution of change scores centered on this zero estimate.
Statistician 1 focuses on change scores, while statistician 2 uses an ANCOVA to examine sex differences at time 2 while adjusting for initial weight.
The model can be depicted as a directed acyclic graph as follows.
In the above we can define the following effects of sex on final weight, or on change through effects on initial and final weight.
Gain is completely determined as the difference between final weight and initial weight, and so its direct effects from initial and final weight are not estimated, but fixed to -1 and +1 respectively. To calculate the total effect of Sex on weight gain, we must sum all the paths from Sex to it- from sex-final-gain, sex-initial-gain, sex-initial-final-gain.
In summary, the two statisticians are focused on different effects from the same model- one the total effect, the other the direct effect.
We can get an explicit sense of the results by means of a hands on example. In the following we have simulated data that will reproduce the situation described thus far. Parameters were chosen for visual and statistical effect. One thing that’s not been noted about the example is that it likely would not occur, in the sense that the total effect would likely be positive, as there would be strong sex differences at initial weight, and a strong correlation between initial and final weight, all in a positive manner (and not construed in a way to nullify the effect). The other issue not addressed is that the entire focus on whether an effect exists hinges on p-values, and with large enough data and such simple models, impractical effects could flag significant. Plus there are other general modeling issues ignored. A goal here is conceptual simplicity however.
set.seed(1234)
N = 200
group = rep(c(0, 1), e=N/2)
initial = .75*group + rnorm(N, sd=.25)
final = .4*initial + .5*group + rnorm(N, sd=.1)
change = final-initial
df = data.frame(id=factor(1:N), group=factor(group, labels=c('Female', 'Male')), initial, final, change)
head(df)
id group initial final change
1 1 Female -0.30176644 -0.07218389 0.2295825445
2 2 Female 0.06935731 0.09741980 0.0280624915
3 3 Female 0.27111029 0.12699551 -0.1441147849
4 4 Female -0.58642443 -0.16449642 0.4219280069
5 5 Female 0.10728117 0.07408057 -0.0332006005
6 6 Female 0.12651397 0.12665183 0.0001378524
id group change time score
1 1 Female 0.22958254 initial -0.30176644
2 1 Female 0.22958254 final -0.07218389
3 2 Female 0.02806249 initial 0.06935731
4 2 Female 0.02806249 final 0.09741980
5 3 Female -0.14411478 initial 0.27111029
6 3 Female -0.14411478 final 0.12699551
In the following we’ll use lavaan to estimate the full mediation model, then run separate regressions to demonstrate the t-test on change vs. the ANCOVA approach. For the mediation model, we only need to estimate the relevant effects on initial and final weight. As noted above, the t-test on change score measures the total effect of sex, while the ANCOVA measures the direct effect on final weight. It is unnecessary to distinguish them as separate modeling approaches, as they are merely standard regressions with different target variables.
mod = "
initial ~ a*group
final ~ b*group + c*initial
# change ~ -1*initial + 1*final (implied)
# total effect
TE := (a*-1) + (a*c*1) + (b*1) # using tracing rules
"
library(lavaan)
lpmod = sem(mod, data=df)
summary(lpmod)
lavaan (0.5-20) converged normally after 28 iterations
Number of observations 200
Estimator ML
Minimum Function Test Statistic 0.000
Degrees of freedom 0
Minimum Function Value 0.0000000000000
Parameter Estimates:
Information Expected
Standard Errors Standard
Regressions:
Estimate Std.Err Z-value P(>|z|)
initial ~
group (a) 0.800 0.036 22.317 0.000
final ~
group (b) 0.447 0.026 17.028 0.000
initial (c) 0.445 0.028 16.043 0.000
Variances:
Estimate Std.Err Z-value P(>|z|)
initial 0.064 0.006 10.000 0.000
final 0.010 0.001 10.000 0.000
Defined Parameters:
Estimate Std.Err Z-value P(>|z|)
TE 0.004 0.024 0.165 0.869
summary(lm(change ~ group, df)) # t-test on change scores = total effect
Call:
lm(formula = change ~ group, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.5174 -0.1133 0.0208 0.1310 0.4612
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.038975 0.017300 2.253 0.0254 *
groupMale 0.004028 0.024466 0.165 0.8694
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.173 on 198 degrees of freedom
Multiple R-squared: 0.0001369, Adjusted R-squared: -0.004913
F-statistic: 0.02711 on 1 and 198 DF, p-value: 0.8694
summary(lm(final ~ group + initial, df)) # 'ancova' uncovers direct effect etc.
Call:
lm(formula = final ~ group + initial, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.315819 -0.066341 0.005835 0.060977 0.268154
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01724 0.01008 1.71 0.0888 .
groupMale 0.44744 0.02648 16.90 <2e-16 ***
initial 0.44538 0.02797 15.92 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1002 on 197 degrees of freedom
Multiple R-squared: 0.9463, Adjusted R-squared: 0.9457
F-statistic: 1734 on 2 and 197 DF, p-value: < 2.2e-16
We can model the change score while adjusting for initial weight (and we should generally). Note that the coefficient for initial weight -0.55 is equivalent to the ANCOVA coefficient (0.45) minus 1. One way to think about this is just as we have been, but focus on the initial weight score instead of the coefficient for sex. The indirect effect on change through final weight is its coefficient (path c) times +1, but the total effect includes the indirect plus the direct effect (i.e. direct effect - 1).
The change score result duplicates the ANCOVA result for the group effect. In fact all coefficients for covariates would be identical in a model for final weight vs. weight gain, as long as the baseline value is controlled for. They are the direct effects for a model with final weight times + 1.
See for example, Laird 1983.
Call:
lm(formula = change ~ group + initial, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.315819 -0.066341 0.005835 0.060977 0.268154
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01724 0.01008 1.71 0.0888 .
groupMale 0.44744 0.02648 16.90 <2e-16 ***
initial -0.55462 0.02797 -19.83 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1002 on 197 degrees of freedom
Multiple R-squared: 0.6662, Adjusted R-squared: 0.6628
F-statistic: 196.6 on 2 and 197 DF, p-value: < 2.2e-16
Wainer & Brown 2007 took a different interpretation of the paradox. Here we can think of a similar situation, but instead of sex differences we now have a group difference regarding whether one dines in a particular room1.
Visually we can depict it as before but showing the difference. The choice of comic sans font in the graph is due to Wainer and Brown and should be held against them.
The DAG makes clear the difference in the model compared to the previous scenario.
While Wainer and Brown again suggest that both statisticians are correct, Pearl disagrees. Statistician 1 is incorrect because they do not adjust for the confounder, which is necessary to determine causal effects.
Note that both paradox scenarios presented assume no latent confounders. If present then both statisticians are potentially wrong in both cases. As depicted however, it was not the case that two legitimate methods gave two different answers to the same research question, as Lord concluded originally.
The problem discussed thus far extends beyond controlling for baseline scores to involving any covariate, where the focus on change scores isn’t even possible2.
Here we are concerned with the relationship of birth weight and infant mortality rate. In general, low birth weight is associated with higher likelihood of death. The paradox arises from the fact that low birth weight children born to smoking mothers have a lower mortality rate.
The DAG for this situation is depicted as follows. Smoking does have an effect on birth weight and infant mortality, but so do a host of other variables, at least some of which are far more detrimental.
Pearl explains the result from two perspectives.
What is the causal effect of birth weight on death?
What is the causal effect of smoking on death?
Another perspective is from the point of Lord’s paradox. Here we are concerned with the effect of smoking on mortality above and beyond its effect though birth weight (i.e. the mediation context of previous). Unlike before (or at least what was assumed before), here we have other confounders.
In this case, adjusting for birth weight doesn’t sever all paths though the mediator, and actually opens up a new path, and the effect is now spurious.
Essentially we end up in the same situation. By conditioning on birth weight == ‘low’, it does not physically keep birth weight from changing. Comparison of smoking vs. non-smoking leads to a comparison of infants with no other causes vs. those with other causes.
Simpson’s paradox refers to a general phenomenon of reversal of results from what is expected. Lord’s paradox can be seen as a special case, and while we have gone through the details of that particular aspect, we can describe Simpson’s paradox with a simple example.
Consider a treatment given to males and females with the following success rates:
Sex | Control | Treatment |
---|---|---|
Male | 23/27 | 8/9 |
Female | 5/8 | 19/26 |
Sex | Control | Treatment |
---|---|---|
Male | 0.85 | 0.89 |
Female | 0.62 | 0.73 |
And what are the total results across male and females?
Sex | Control | Treatment |
---|---|---|
All | 28/35 | 27/35 |
80% | 77% |
So we are back to our low birth weight issue.
Pearl notes three things are required for resolving such a paradox.
The surprise is as we have just noted. We see individual proportion results, but the sum of those proportions leads to a different conclusion, thus invoking surprise. The ‘paradox’ isn’t really a paradox, as the result is just arithmetic. However, the surprise it invokes tends us toward thinking of it as such. Our intuition tells us that, for example, a drug can’t be harmful to both men and women but good for the population as a whole. This is in fact the case, but statistically it can happen if we aren’t applying an appropriate model.
Pearl’s sure thing theorem:
An action A that increase the probability of event B in each subpopulation must also increase the probability of B in the whole population, provided that the action does not change the distribution of the subpopulations.
In other words, regardless if some effect Z is a confounder or not, and even if we don’t have the correct causal structure, such reversal should invokes suspicion rather than surprise. In the above example, simply having appropriate amounts of data would likely be enough to rule out a reversal.
In the following graphical models we have some treatment X, and some recovery Y, with an additional covariate Z. In addition, for some we have some additional latent variable(s) L3.
All of the set 1 graphs are situations that might invite reversal, and in fact are observationally equivalent.
In the following graphs we could have reversal in a-c, but not d-f.
Pearl suggests using the back-door criterion in order to help us make a decision, summarized as follows:
This leads to the following conclusions:
In set 1 we need to condition on Z in a and d (blocking the back door path \(X \leftarrow Z \rightarrow Y\)). We would not in b and c because in b, there are no back door paths, and in c the backdoor path is blocked when not conditioned on.
When conditioning on Z is required, the Z specific information carries the correct information. However in other cases, e.g. Set 1 graph c, and Set 2 graphs a and c, the aggregated information is correct because the spurious path \(X \rightarrow Z \leftarrow Y\) is blocked if Z is not conditioned on.
In some cases there is not enough information with Z to block potential back-door paths, as in Set 2 b.
Unfortunately most modeling situations are much more complex than the simple scenarios depicted. Most of the time experiment is not an option, and the nature of the relationships of variables ambiguous, leaving any causal explanation an impossible prospect. However, even in those situations, thinking causally can help our general understanding, and perhaps make some of these ‘surprising’ situations less so.
Laird, N. 1983. Further Comparative Analyses of Pretest-Posttest Research Designs. link
Pearl, J. 2014. Lord’s Paradox Revisited – (Oh Lord! Kumbaya!). link
Pearl, J. 2013. Understanding Simpson’s Paradox. link
Senn, S. 2006. Change from baseline and analysis of covariance revisited. link
The reasoning behind not using sex was because it is not a manipulable variable. See Holland & Rubin (186) “No causation without manipulation.”↩
Lord himself acknowledged this in determining group differences on college freshman grade point average while adjusting for ‘aptitude’.↩
Oddly Pearl doesn’t actually mention what L represents anywhere in the article.↩