Generalized Estimating Equations

A generalized estimating equation is an estimation procedure13 for dealing with clustered data, and is seemingly very popular in disciplines trained with a biostatistics perspective, but perhaps not too commonly used elsewhere. Models using this approach are sometimes called marginal models, and can be seen as follows, where the target $$y$$ is multivariate normal with mean vector $$X\beta$$ and covariance matrix $$\Sigma$$, which is typically block diagonal as we created at the beginning14.

$y \sim \mathcal{N}(X\beta, \Sigma)$

Here the focus is on the ‘population average’ effects, akin to ‘fixed’ effects in the RE model, and in some circumstances they are identical. In addition, one specifies the type of covariance structure thought to underlie the data. Among the more common are:

• independence: no correlation
• autoregressive: as described previously
• exchangeable: same correlation everywhere (aka ‘compound symmetry’ or ‘spherical’)
• unstructured: all possible correlations are estimated as parameters

And there are many others. The GEE approach is identical to RE intercept-only model approach if one conducts a linear Gaussian model, as in this case. In addition, these correlation structures are often available to mixed models tools, so this extension alone should not be a reason15 to use the GEE approach.

We’ll use the geepack package in order to conduct the gee approach to the model. We’ll specify an autoregressive correlation structure as we did with the mixed model, and as the data was in fact designed with.

library(geepack)

gee_mod = geeglm(y ~ time + treatment, data=d, corstr='ar1', id=id, waves=time)
Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.127 0.030 17.7 0
time 0.507 0.009 3117.9 0
treatmenttreatment -0.413 0.041 100.6 0
Residual Variance
1.48
Estimate Std.err
alpha 0.757 0.007

We should be getting used to the coefficient estimates for the population-average, a.k.a. fixed effects, by now. We also obtain an estimate for the residual correlation, here noted as alpha.

Note the similarities here compared with the mixed model where there is only a random intercept. A couple of changes are made to keep things as similar as possible.

mixed_mod_ri_only = lme(y ~ time + treatment, data=d, random = ~1|id, cor=corAR1(form=~time),
control=lmeControl(opt='optim'), method='ML')  # changed opt bc nlminb had issues
Fixed effects: y ~ time + treatment
Value Std.Error DF t-value p-value
(Intercept) 0.126 0.032 7499 3.92 0
time 0.507 0.008 7499 60.59 0
treatmenttreatment -0.412 0.042 2498 -9.82 0
Standardized Within-Group Residuals
Min Q1 Med Q3 Max
-3.85 -0.66 0.03 0.66 3.45
Linear mixed-effects model fit by maximum likelihood : y ~ time + treatment
Observations Groups Log-restricted-likelihood
id 10000 2500 -12707
Variance StdDev
0.003 0.051
1.504 1.227
Phi
0.78

The population average effects are identical (though the geeglm function automatically does cluster robust standard errors). The estimated correlations for both are similar, and a bit high. There is essentially no cluster variance in the mixed model, and both estimated residual variances are similar, and similar to the standard linear model we started with. This makes sense as the variance is equal to the residual variance + the intercept variance + slope variance.

summary(lm_mod)\$sigma^2                   # similar to gee and random intercept only
[1] 1.479611
sum(as.numeric(VarCorr(mixed_mod)[,1]))   # model that incorporates all sources of variance
[1] 1.306739
intsd^2 + timesd^2 + sigmasq              # 'truth'
[1] 1.3125

GEE models are generally robust to misspecification of the covariance structure. They are rarely implemented for more than simple clustering but some tools allow for it16.

Where GEE differ from random effects models in interpretation regards non-Gaussian models. Consider a gender effect for a binary outcome where individuals are nested within families. In this case we could use a standard logistic regression, but would want to take the clustering into account. The gender effect for GEE is comparing hypothetical males in many different families to hypothetical females in many different families (controlling for other covariates). In the mixed model with a random effect for family, you are comparing males in one family to females in the same family (again controlling for other factors). The latter odds ratio will typically be larger because there is less fluctuation due to family-to-family differences17. However, this isn’t a larger effect, just a different one.

Pros

• Easy modeling of different correlation structures
• Focus on population level effects
• Extends the cluster robust approach to GLM setting and other correlation structures18
• Robust to misspecification of correlation structure
• May be more feasible with larger data situations than mixed models

Cons

• Cluster-specific effects are not estimated
• Not easily extendable to other clustering situations, e.g. nested
• Missing data is assumed Missing Completely at Random (MCAR) (RE and FE assume MAR)

Gist: If your goal is to focus on population average effects and ignore subject specific effects, without the drawbacks of the FE model, the GEE approach might be considered.

1. Not a model! We’re just doing a GLM here.

2. The corresponding matrix formulation for the conditional model of the random effects approach is: $y \sim \mathcal{N}(X\beta + Z\Gamma, \Sigma)$ where $$Z$$ is some subset of $$X$$, and $$\Gamma$$ contains the random effects.

3. lme4 is a very widely used mixed model package that does not allow the specification of a correlation structure for the residuals.

4. CSCAR director and University of Michigan statistics faculty Kerby Shedden has worked on a Python implementation for nested structures as part of the statsmodels module.

5. This is a paraphrase of a response by Kerby Shedden to a question on the CSCAR help list that I thought was succinct and easy to understand. He continues: “I don’t have a strong opinion about this. It is almost always good to control for observed variables to reduce heterogeneity and get a sharper analysis. But it’s less clear to me that you should control for things that are unattributable (e.g. family effects of unknown cause), even if it is technically possible to do so. It comes down to this: do you want to measure my risk (for being male) by comparing me to my sister (or an imaginary sister who is identical to me in all observables except gender)…or do you want to compare me to an unrelated female, also identical to me in terms of all observables?”

6. Recall the note at the end of the cluster robust SE chapter.