Cluster Robust Variances
As mentioned, if we ignore the clustering, the so-called fixed effects estimates (i.e. coefficients) are correct, but their variances are not. We can get proper estimates of the standard errors via cluster robust standard errors, which are very popular in econometrics and fields trained in that fashion, but not widely used elsewhere in my experience. Essentially, these allow one to fire-and-forget, and treat the clustering as more of a nuisance.
In programs like Stata, obtaining these are basically an option for most modeling procedures. In R, it’s not quite as straightforward, but not difficult. There are packages such as sandwich that can provide heteroscedastic robust standard errors, but won’t necessarily take into account clustering. I provide a custom function that will work in this example so that the curtain can be pulled back a little, but the plm package would be the way to go for cluster robust standard errors. The plm package can take into account the serial autocorrelation via the ‘arellano’ input to the type
argument.
library(plm)
# create numeric time so plm won't treat as a factor, time is a reserved name in plm
t2 = d$time
clusterrob_mod <- plm(y ~ t2 + treatment, data=d, model='pooling', index='id')
summary(clusterrob_mod)
Pooling Model
Call:
plm(formula = y ~ t2 + treatment, data = d, model = "pooling",
index = "id")
Balanced Panel: n = 2500, T = 4, N = 10000
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-4.724323 -0.810506 0.030167 0.815708 4.231312
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.130929 0.023712 5.5217 3.442e-08 ***
t2 0.508559 0.010880 46.7436 < 2.2e-16 ***
treatmenttreatment -0.421493 0.024328 -17.3255 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 18469
Residual Sum of Squares: 14792
R-Squared: 0.1991
Adj. R-Squared: 0.19894
F-statistic: 1242.57 on 2 and 9997 DF, p-value: < 2.22e-16
The following compares the different robust SEs we could produce at this point.
resids = resid(lm_mod) # residuals
xvar = solve(crossprod(lm_mod$x)) # (X'X)^-1
gd0 = data.frame(id=d$id, r=resids, lm_mod$x)
# custom function to be applied within cluster
cluscom = function(mat){
X = as.matrix(mat)[, 3:5] # magic numbers represent int, time and treatment columns
crossprod(X, tcrossprod(mat$r)) %*% X
}
# calculate within cluster variance
gd = gd0 %>%
group_by(id) %>%
do(xvar = cluscom(.))
# plm
plm_se = vcovHC(clusterrob_mod, method='arellano', cluster='group') %>% diag %>% sqrt %>% round(3)
# custom output
custom_se = (xvar %*% Reduce(`+`, gd$xvar) %*% xvar) %>% diag %>% sqrt %>% round(3)
# original lm
lm_se = vcov(lm_mod) %>% diag %>% sqrt %>% round(3)
# non-clustered 'robust' se
lm_robse = vcovHC(lm_mod, type='HC0') %>% diag %>% sqrt %>% round(3)
(Intercept) t2 treatmenttreatment
0.030 0.009 0.042
(Intercept) time treatmenttreatment
0.030 0.009 0.042
(Intercept) time treatmenttreatment
0.024 0.011 0.024
(Intercept) time treatmenttreatment
0.023 0.011 0.024
Pros
- Don’t have to think too much about the issues at play
- Fewer assumptions about the model
- A general heteroscedastic ‘robust’ approach
Cons
- Ignores cluster-specific effects
- By default, assumes no intracluster correlation
- Few tools go beyond simple clustering, requiring manual approach
- Not as general as GEE approach
- Suggests that there are problems in model specification that the method itself does not correct
Gist: if your goal is to simply do a standard GLM approach and essentially ignore the clustering, treating it more or less as a nuisance to overcome, and your grouping structure is simple, then this may be for you. However, note that the GEE approach as a possibly more flexible alternative that can still incorporate cluster robust standard errors4. Furthermore, differences between the robust and regular SE may suggest a model misspecification issue. ‘Fixing’ standard errors won’t solve that problem.
The cluster robust standard errors as depicted above assume independent error structure in the GEE model, and thus are a special case of GEE. When you peruse the GEE model, rerun it with the argument
corst='ind'
to duplicate the plm and custom results above.↩