Revisiting an old post
A couple of folks I work with in different capacities independently came across an article by Data Camp’s David Robinson1 demonstrating empirical bayes. It provides a nice and simple example of how to create a prior from the observed data, allowing it to induce shrinkage in estimates, in that case, career batting averages of Major League Baseball players. This would better allow one to compare someone that had only a relatively few at-bats to those that had longer careers.
It is a simple and straightforward demo, and admits that it doesn’t account for many other things that could be brought into the model, but that’s also why it’s effective at demonstrating the technique. However, shrinkage of parameter estimates can be accomplished in other ways, so I thought I’d compare it to two of my preferred ways to do so - a fully Bayesian approach and a random effects/mixed-model approach.
I demonstrate shrinkage in mixed models in more detail here and here, and I’m not going to explain Bayesian analysis in general, but feel free to see my doc on it. This post is just to provide a quick comparison of techniques.
We’ll start as we typically do, with the data. The following just duplicates David’s code from the article. Nothing new here. If you want the details, please read it.
library(dplyr)
library(tidyr)
library(Lahman)
career <- Batting %>%
filter(AB > 0) %>%
anti_join(Pitching, by = "playerID") %>% # This removes Babe Ruth!
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
mutate(average = H / AB)
# use names along with the player IDs
career <- People %>%
as_tibble() %>%
select(playerID, nameFirst, nameLast) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career, by = "playerID")
career_filtered <- career %>%
filter(AB >= 500)
With data in place, we can get the empirical bayes estimates. Again, this is just the original code. As a reminder, we assume a beta distribution for batting average, and the mean of the filtered data is 0.258. This finds the corresponding \(\alpha\) and \(\beta\) values for the beta distribution using MASS.
We use the estimated parameters as input for the beta prior. Let’s examine what we’ve got.
Just to refresh, we can see how the EB estimates are able to guess something more meaningful for someone with just a few at-bats than say, a 0 batting average. Even for Ody Abbot there, we would guess something closer to the overall average than their .186 average after 70 plate appearances. With Frank Abercrombie, who had no hits in a measly 4 at bats, with so little information, we’d give him the benefit of the doubt of being average.
As mentioned, I will compare the empirical bayes results to those of a couple of other approaches. They are:
The advantages to these are that using a fully Bayesian approach allows us to not approximate the Bayesian and just do it. In the other case, the standard mixed model provides shrinkage with a penalized regression approach which also approximates the Bayesian, but doesn’t require any double dipping of the data to get at a prior, or any additional steps aside from running the model.
In both cases, we can accomplish the desired result with just a standard R modeling approach. In particular, the model is a standard binomial model for counts. With base R glm, we would do something like the following:
The model is actually for the count of successes out of the total,
which R has always oddly done in glm as
cbind(# successes, # failures)
rather than the more
intuitive route (my opinion). The brms package
will make it more obvious, but glmer uses the
glm approach. The key difference for both
models relative to the standard binomial is that we add a
per-observation random effect for
playerID
2.
We’ll start with the full Bayesian approach using brms. This model will struggle a bit3, and takes a while to run, as it’s estimating 9864 parameters. But in the end we get what we want.
# in case anyone wants to use rstanarm I show it here
# library(rstanarm)
# bayes_full = stan_glmer(cbind(H, AB-H) ~ 1 + (1|playerID),
# data = career_eb,
# family = binomial)
library(brms)
bayes_full = brm(H|trials(AB) ~ 1 + (1|playerID),
data = career_eb,
family = binomial,
seed = 1234,
iter = 1000,
thin = 4,
cores = 4)
With the posterior predictive check we can see right off the bat4 that this approach estimates the data well. Our posterior predictive distribution for the number of hits is hardly distinguishable from the observed data.
Again, the binomial model is for counts (out of some total), in this case, the number of hits. But if we wanted proportions, which in this case are the batting averages, we could just divide this result by the AB (at bats) column. Here we can see a little more nuance, especially that the model shies away from the lower values more, but this would still be a good fit by any standards.
The lme4 model takes the glm approach as
far as syntax goes cbind(successes, non-successes)
. Very
straightforward, and fast, as it doesn’t actually estimate the random
effects, but instead predicts them. The predicted random
effects are in fact akin to empirical bayes estimates5.
Since David’s original ‘prior’ was based only on observations for those who had at least 500+ at bats (essentially a full season), the following re-runs the previous models just for the filtered data set, to see how those comparisons turn out.
Now we’re ready to make some comparisons. We’ll combine the fits from the models to the original data set.
career_other = career_eb %>%
mutate(
bayes_estimate = fitted(bayes_full)[,1] / AB,
glmer_estimate = fitted(glmer_full),
)
career_other_filtered = career_filtered %>%
mutate(
bayes_filtered_estimate = fitted(bayes_filtered)[,1] / AB,
glmer_filtered_estimate = fitted(glmer_filtered),
) %>%
select(playerID, contains('filter'))
career_all = left_join(career_other,
career_other_filtered)