Revisiting an old post

A couple of folks I work with in different capacities independently
came across an
article by Data Campâ€™s David
Robinson^{1} demonstrating empirical bayes. It provides a nice and simple
example of how to create a prior from the observed data, allowing it to
induce shrinkage in estimates, in that case, career batting averages of
Major League Baseball players. This would better allow one to compare
someone that had only a relatively few at-bats to those that had longer
careers.

It is a simple and straightforward demo, and admits that it doesnâ€™t account for many other things that could be brought into the model, but thatâ€™s also why itâ€™s effective at demonstrating the technique. However, shrinkage of parameter estimates can be accomplished in other ways, so I thought Iâ€™d compare it to two of my preferred ways to do so - a fully Bayesian approach and a random effects/mixed-model approach.

I demonstrate shrinkage in mixed models in more detail here and here, and Iâ€™m not going to explain Bayesian analysis in general, but feel free to see my doc on it. This post is just to provide a quick comparison of techniques.

Weâ€™ll start as we typically do, with the data. The following just duplicates Davidâ€™s code from the article. Nothing new here. If you want the details, please read it.

```
library(dplyr)
library(tidyr)
library(Lahman)
career <- Batting %>%
filter(AB > 0) %>%
anti_join(Pitching, by = "playerID") %>% # This removes Babe Ruth!
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
mutate(average = H / AB)
# use names along with the player IDs
career <- People %>%
as_tibble() %>%
select(playerID, nameFirst, nameLast) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career, by = "playerID")
career_filtered <- career %>%
filter(AB >= 500)
```

With data in place, we can get the empirical bayes estimates. Again, this is just the original code. As a reminder, we assume a beta distribution for batting average, and the mean of the filtered data is 0.258. This finds the corresponding \(\alpha\) and \(\beta\) values for the beta distribution using MASS.

We use the estimated parameters as input for the beta prior. Letâ€™s examine what weâ€™ve got.

Just to refresh, we can see how the EB estimates are able to guess something more meaningful for someone with just a few at-bats than say, a 0 batting average. Even for Ody Abbot there, we would guess something closer to the overall average than their .186 average after 70 plate appearances. With Frank Abercrombie, who had no hits in a measly 4 at bats, with so little information, weâ€™d give him the benefit of the doubt of being average.

As mentioned, I will compare the empirical bayes results to those of a couple of other approaches. They are:

- Bayesian mixed model on full data (using brms)
- Standard mixed model on full data (using lme4)
- Bayesian mixed model on filtered data (at bats greater than 500)
- Standard mixed model on filtered data

The advantages to these are that using a fully Bayesian approach allows us to not approximate the Bayesian and just do it. In the other case, the standard mixed model provides shrinkage with a penalized regression approach which also approximates the Bayesian, but doesnâ€™t require any double dipping of the data to get at a prior, or any additional steps aside from running the model.

In both cases, we can accomplish the desired result with just a standard R modeling approach. In particular, the model is a standard binomial model for counts. With base R glm, we would do something like the following:

The model is actually for the count of successes out of the total,
which R has always oddly done in glm as
`cbind(# successes, # failures)`

rather than the more
intuitive route (my opinion). The brms package
will make it more obvious, but glmer uses the
glm approach. The key difference for both
models relative to the standard binomial is that we add a
per-observation random effect for
`playerID`

^{2}.

Weâ€™ll start with the full Bayesian approach using brms. This model will struggle a bit^{3},
and takes a while to run, as itâ€™s estimating 9864 parameters. But in the
end we get what we want.

```
# in case anyone wants to use rstanarm I show it here
# library(rstanarm)
# bayes_full = stan_glmer(cbind(H, AB-H) ~ 1 + (1|playerID),
# data = career_eb,
# family = binomial)
library(brms)
bayes_full = brm(H|trials(AB) ~ 1 + (1|playerID),
data = career_eb,
family = binomial,
seed = 1234,
iter = 1000,
thin = 4,
cores = 4)
```

With the posterior predictive check we can see right off the bat^{4} that this approach estimates the
data well. Our posterior predictive distribution for the number of hits
is hardly distinguishable from the observed data.

Again, the binomial model is for counts (out of some total), in this case, the number of hits. But if we wanted proportions, which in this case are the batting averages, we could just divide this result by the AB (at bats) column. Here we can see a little more nuance, especially that the model shies away from the lower values more, but this would still be a good fit by any standards.

The lme4 model takes the glm approach as
far as syntax goes `cbind(successes, non-successes)`

. Very
straightforward, and fast, as it doesnâ€™t actually estimate the random
effects, but instead *predicts* them. The predicted random
effects are in fact akin to empirical bayes estimates^{5}.

Since Davidâ€™s original â€˜priorâ€™ was based only on observations for those who had at least 500+ at bats (essentially a full season), the following re-runs the previous models just for the filtered data set, to see how those comparisons turn out.

Now weâ€™re ready to make some comparisons. Weâ€™ll combine the fits from the models to the original data set.

```
career_other = career_eb %>%
mutate(
bayes_estimate = fitted(bayes_full)[,1] / AB,
glmer_estimate = fitted(glmer_full),
)
career_other_filtered = career_filtered %>%
mutate(
bayes_filtered_estimate = fitted(bayes_filtered)[,1] / AB,
glmer_filtered_estimate = fitted(glmer_filtered),
) %>%
select(playerID, contains('filter'))
career_all = left_join(career_other,
career_other_filtered)
```