Empirical Bayes

Revisiting an old post

Michael Clark https://m-clark.github.io
2019-06-21

Table of Contents


Introduction

A couple of folks I work with in different capacities independently came across an article by Data Camp’s David Robinson1 demonstrating empirical bayes. It provides a nice and simple example of how to create a prior from the observed data, allowing it to induce shrinkage in estimates, in that case, career batting averages of Major League Baseball players. This would better allow one to compare someone that had only a relatively few at-bats to those that had longer careers.

It is a simple and straightforward demo, and admits that it doesn’t account for many other things that could be brought into the model, but that’s also why it’s effective at demonstrating the technique. However, shrinkage of parameter estimates can be accomplished in other ways, so I thought I’d compare it to two of my preferred ways to do so - a fully Bayesian approach and a random effects/mixed-model approach.

I demonstrate shrinkage in mixed models in more detail here and here, and I’m not going to explain Bayesian analysis in general, but feel free to see my doc on it. This post is just to provide a quick comparison of techniques.

Data Setup

We’ll start as we typically do, with the data. The following just duplicates David’s code from the article. Nothing new here. If you want the details, please read it.


library(dplyr)
library(tidyr)
library(Lahman)

career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(Pitching, by = "playerID") %>%  # This removes Babe Ruth!
  group_by(playerID) %>%
  summarize(H = sum(H), AB = sum(AB)) %>%
  mutate(average = H / AB)

# use names along with the player IDs
career <- Master %>%
  tbl_df() %>%
  select(playerID, nameFirst, nameLast) %>%
  unite(name, nameFirst, nameLast, sep = " ") %>%
  inner_join(career, by = "playerID") 

career_filtered <- career %>%
  filter(AB >= 500)

With data in place, we can get the empirical bayes estimates. Again, this is just the original code. As a reminder, we assume a beta distribution for batting average, and the mean of the filtered data is 0.259. This finds the corresponding \(\alpha\) and \(\beta\) values for the beta distribution using MASS.


m <- MASS::fitdistr(career_filtered$average, 
                    dbeta,
                    start = list(shape1 = 1, shape2 = 10))

alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]

career_eb <- career %>%
  mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0))


We use the estimated parameters as input for the beta prior. Let’s examine what we’ve got.


Just to refresh, we can see how the EB estimates are able to guess something more meaningful for someone with just a few at-bats than say, a 0 batting average. Even for Ody Abbot there, we would guess something closer to the overall average than their .186 average after 70 plate appearances. With Frank Abercrombie, who had no hits in a measly 4 at bats, with so little information, we’d give him the benefit of the doubt of being average.

Models

As mentioned, I will compare the empirical bayes results to those of a couple of other approaches. They are:

The advantages to these are that using a fully Bayesian approach allows us to not approximate the Bayesian and just do it. In the other case, the standard mixed model provides shrinkage with a penalized regression approach which also approximates the Bayesian, but doesn’t require any double dipping of the data to get at a prior, or any additional steps aside from running the model.

In both cases, we can accomplish the desired result with just a standard R modeling approach. In particular, the model is a standard binomial model for counts. With base R glm, we would do something like the following:


glm(cbind(H, AB-H) ~ ..., data = career_eb, family = binomial)

The model is actually for the count of successes out of the total, which R has always oddly done in glm as cbind(# successes, # failures) rather than the more intuitive route (my opinion). The brms package will make it more obvious, but glmer uses the glm approach. The key difference for both models relative to the standard binomial is that we add a per-observation random effect for playerID2.

Bayesian Model

We’ll start with the full Bayesian approach using brms. This model will struggle a bit3, and takes a while to run, as it’s estimating 9672 parameters. But in the end we get what we want.


# in case anyone wants to use rstanarm I show it here
# library(rstanarm)
# bayes_full = stan_glmer(cbind(H, AB-H) ~ 1 + (1|playerID),
#                         data = career_eb,
#                         family = binomial)

library(brms)
bayes_full = brm(H|trials(AB) ~ 1 + (1|playerID), 
                 data = career_eb,
                 family = binomial,
                 seed = 1234,
                 iter = 1000,
                 thin = 4,
                 cores = 4)

With the posterior predictive check we can see right off the bat4 that this approach estimates the data well. Our posterior predictive distribution for the number of hits is hardly distinguishable from the observed data.

Again, the binomial model is for counts (out of some total), in this case, the number of hits. But if we wanted proportions, which in this case are the batting averages, we could just divide this result by the AB (at bats) column. Here we can see a little more nuance, especially that the model shies away from the lower values more, but this would still be a good fit by any standards.

Mixed Model

The lme4 model takes the glm approach as far as syntax goes cbind(successes, non-successes). Very straightforward, and fast, as it doesn’t actually estimate the random effects, but instead predicts them. The predicted random effects are in fact akin to empirical bayes estimates5.


glmer_full = lme4::glmer(cbind(H, AB-H) ~ 1 + (1|playerID), 
                         data = career_eb,
                         family = binomial)

Filtered Data Models

Since David’s original ‘prior’ was based only on observations for those who had at least 500+ at bats (essentially a full season), the following re-runs the previous models just for the filtered data set, to see how those comparisons turn out.


bayes_filtered = brm(H|trials(AB) ~ 1 + (1|playerID), 
                     data = career_eb %>% filter(AB >= 500),
                     family = binomial,
                     iter = 1000,
                     seed = 1234,
                     thin = 4,
                     cores = 4)

glmer_filtered = lme4::glmer(cbind(H, AB-H) ~ 1 + (1|playerID), 
                             data = career_eb %>% filter(AB >= 500),
                             family = binomial)

Prediction Comparisons

Now we’re ready to make some comparisons. We’ll combine the fits from the models to the original data set.


career_other = career_eb %>% 
  mutate(
    bayes_estimate = fitted(bayes_full)[,1] / AB,
    glmer_estimate = fitted(glmer_full),
  )

career_other_filtered = career_filtered %>% 
  mutate(
    bayes_filtered_estimate = fitted(bayes_filtered)[,1] / AB,
    glmer_filtered_estimate = fitted(glmer_filtered),
  ) %>% 
  select(playerID, contains('filter'))

career_all = left_join(career_other, 
                       career_other_filtered)