Generalized Estimating Equations

A generalized estimating equation is an estimation procedure¹³ for dealing with clustered data, and is seemingly very popular in disciplines trained with a biostatistics perspective, but perhaps not too commonly used elsewhere. Models using this approach are sometimes called marginal models, and can be seen as follows, where the target \(y\) is multivariate normal with mean vector \(X\beta\) and covariance matrix \(\Sigma\), which is typically block diagonal as we created at the beginning¹⁴.

\[ y \sim \mathcal{N}(X\beta, \Sigma) \]

Here the focus is on the ‘population average’ effects, akin to ‘fixed’ effects in the RE model, and in some circumstances they are identical. In addition, one specifies the type of covariance structure thought to underlie the data. Among the more common are:

independence: no correlation
autoregressive: as described previously
exchangeable: same correlation everywhere (aka ‘compound symmetry’ or ‘spherical’)
unstructured: all possible correlations are estimated as parameters

And there are many others. The GEE approach is identical to RE intercept-only model approach if one conducts a linear Gaussian model, as in this case. In addition, these correlation structures are often available to mixed models tools, so this extension alone should not be a reason¹⁵ to use the GEE approach.

We’ll use the geepack package in order to conduct the gee approach to the model. We’ll specify an autoregressive correlation structure as we did with the mixed model, and as the data was in fact designed with.

library(geepack)

gee_mod = geeglm(y ~ time + treatment, data=d, corstr='ar1', id=id, waves=time)

	Estimate	Std.err	Wald
(Intercept)	0.127	0.030	17.7
time	0.507	0.009	3117.9
treatmenttreatment	-0.413	0.041	100.6

Residual Variance
1.48

	Estimate	Std.err
alpha	0.757	0.007

We should be getting used to the coefficient estimates for the population-average, a.k.a. fixed effects, by now. We also obtain an estimate for the residual correlation, here noted as alpha.

Note the similarities here compared with the mixed model where there is only a random intercept. A couple of changes are made to keep things as similar as possible.

mixed_mod_ri_only = lme(y ~ time + treatment, data=d, random = ~1|id, cor=corAR1(form=~time), 
                        control=lmeControl(opt='optim'), method='ML')  # changed opt bc nlminb had issues

Fixed effects: y ~ time + treatment
	Value	Std.Error	DF	t-value
(Intercept)	0.126	0.032	7499	3.92
time	0.507	0.008	7499	60.59
treatmenttreatment	-0.412	0.042	2498	-9.82

Standardized Within-Group Residuals
Min	Q1	Med	Q3	Max
-3.85	-0.66	0.03	0.66	3.45

Linear mixed-effects model fit by maximum likelihood : y ~ time + treatment
	Observations	Groups	Log-restricted-likelihood
id	10000	2500	-12707

Variance	StdDev
0.003	0.051
1.504	1.227

Phi
0.78

The population average effects are identical (though the geeglm function automatically does cluster robust standard errors). The estimated correlations for both are similar, and a bit high. There is essentially no cluster variance in the mixed model, and both estimated residual variances are similar, and similar to the standard linear model we started with. This makes sense as the variance is equal to the residual variance + the intercept variance + slope variance.

summary(lm_mod)$sigma^2                   # similar to gee and random intercept only

[1] 1.479611

sum(as.numeric(VarCorr(mixed_mod)[,1]))   # model that incorporates all sources of variance

[1] 1.306739

intsd^2 + timesd^2 + sigmasq              # 'truth'

[1] 1.3125

GEE models are generally robust to misspecification of the covariance structure. They are rarely implemented for more than simple clustering but some tools allow for it¹⁶.

Where GEE differ from random effects models in interpretation regards non-Gaussian models. Consider a gender effect for a binary outcome where individuals are nested within families. In this case we could use a standard logistic regression, but would want to take the clustering into account. The gender effect for GEE is comparing hypothetical males in many different families to hypothetical females in many different families (controlling for other covariates). In the mixed model with a random effect for family, you are comparing males in one family to females in the same family (again controlling for other factors). The latter odds ratio will typically be larger because there is less fluctuation due to family-to-family differences¹⁷. However, this isn’t a larger effect, just a different one.

Pros

Easy modeling of different correlation structures
Focus on population level effects
Extends the cluster robust approach to GLM setting and other correlation structures¹⁸
Robust to misspecification of correlation structure
May be more feasible with larger data situations than mixed models

Cons

Cluster-specific effects are not estimated
Not easily extendable to other clustering situations, e.g. nested
Missing data is assumed Missing Completely at Random (MCAR) (RE and FE assume MAR)

Gist: If your goal is to focus on population average effects and ignore subject specific effects, without the drawbacks of the FE model, the GEE approach might be considered.

Not a model! We’re just doing a GLM here.↩
The corresponding matrix formulation for the conditional model of the random effects approach is: \[ y \sim \mathcal{N}(X\beta + Z\Gamma, \Sigma) \] where \(Z\) is some subset of \(X\), and \(\Gamma\) contains the random effects.↩
lme4 is a very widely used mixed model package that does not allow the specification of a correlation structure for the residuals.↩
CSCAR director and University of Michigan statistics faculty Kerby Shedden has worked on a Python implementation for nested structures as part of the statsmodels module.↩
This is a paraphrase of a response by Kerby Shedden to a question on the CSCAR help list that I thought was succinct and easy to understand. He continues: “I don’t have a strong opinion about this. It is almost always good to control for observed variables to reduce heterogeneity and get a sharper analysis. But it’s less clear to me that you should control for things that are unattributable (e.g. family effects of unknown cause), even if it is technically possible to do so. It comes down to this: do you want to measure my risk (for being male) by comparing me to my sister (or an imaginary sister who is identical to me in all observables except gender)…or do you want to compare me to an unrelated female, also identical to me in terms of all observables?”↩
Recall the note at the end of the cluster robust SE chapter.↩