Maximum Likelihood Review

This is a very brief refresher on maximum likelihood estimation using a standard regression approach as an example. Given the likelihood’s role in Bayesian estimation and statistics in general, and the ties between specific Bayesian results and maximum likelihood estimates one typically comes across, one should be conceptually comfortable with some basic likelihood estimation.

In the standard model setting we attempt to find parameters \(\theta\) that will maximize the probability of the data we actually observe. We’ll start with an observed random target vector \(y\) with \(i...N\) independent and identically distributed observations and some data-generating process underlying it \(f(\cdot|\theta)\). The key idea is that we are interested in estimating the model parameter(s), \(\theta\), that would make the data most likely to have occurred. The probability density function for an observed value of \(y\) given some particular estimate for the parameters can be noted as \(f(y_i|\theta)\). The joint probability distribution of the (independent) observations given those parameters is the product of the individual densities, and this is our basic likelihood function. We can write it out generally as:

\[\mathcal{L}(\theta) = \prod_{i=1}^N f(y_i|\theta)\]

So the likelihood for one set of parameter estimates given a fixed set of data y, is equal to the probability of the data given those (fixed) estimates. Furthermore, we can compare one set, \(\mathcal{L}(\theta_A)\), to that of another, \(\mathcal{L}(\theta_B)\), and whichever produces the greater likelihood would be the preferred set of estimates. We can get a sense of this with the following visualization, based on a single parameter, Poisson distributed variable. The data is drawn from a variable with mean \(\theta=5\). We note the calculated likelihood increases as we estimate values for \(\theta\) closer to \(5\). With more and more data, the final ML estimate will converge on the true value.

Final estimate = 5.02

For computational reasons, we instead work with the sum of the natural log probabilities56, and thus the log likelihood:

\[\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[f(y_i|\theta)]\]

Concretely, we calculate a log likelihood for each observation and then sum them for the total likelihood for parameter(s) \(\theta\).

The likelihood function incorporates our assumption about the sampling distribution of the data given some estimate for the parameters. It can take on many forms and be notably complex depending on the model in question, but once specified, we can use any number of optimization approaches to find the estimates of the parameter that make the data most likely. As an example, for a normally distributed variable of interest we can write the log likelihood as follows:

\[\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-\mu)^2}{2\sigma^2})]\]


In the following we will demonstrate the maximum likelihood approach to estimation for a simple setting incorporating a normal distribution, where we estimate the mean and variance/sd for a set of values \(y\)57. First the data is created, and then we create the function that will compute the log likelihood. Using the built in R distributions58 makes it fairly straightforward to create our own likelihood function and feed it into an optimization function to find the best parameters. We will set things up to work with the bbmle package, which has some nice summary functionality and other features. However, one should take a glance at optim and the other underlying functions that do the work.

# for replication

# create the data
y = rnorm(1000, mean = 5, sd = 2)
startvals = c(0, 1)

# the log likelihood function
LL = function(mu = startvals[1], sigma = startvals[2], verbose = TRUE) {
  ll = -sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
  if (verbose) message(paste(mu, sigma, ll))

The LL function takes starting points for the parameters as arguments, in this case we call them \(\mu\) and \(\sigma\), which will be set to 0 and 1 respectively. Only the first line (ll = -sum…) is actually necessary, and we use dnorm to get the density for each point59. Since this optimizer is by default minimization, we reverse the sign of the sum so as to minimize the negative log likelihood, which is the same as maximizing the likelihood. Note that the bit of other code just allows you to see the estimates as the optimization procedure searches for the best values. I do not show that here, but you’ll see it in your console if verbose=TRUE.

We are now ready to obtain maximum likelihood estimates for the parameters. For the mle2 function we will need the function we’ve created, plus other inputs related to that function or the underlying optimizing function used (by default optim). In this case we will use an optimization procedure that will allow us to set a lower bound for \(\sigma\). This isn’t strictly necessary, but otherwise you would get warnings and possibly lack of convergence if negative estimates for \(\sigma\) were allowed60.

# using optim, and L-BFGS-B so as to constrain sigma to be positive by setting
# the lower bound at zero
mlnorm =  mle2(
  start  = list(mu = 2, sigma = 1),
  method = "L-BFGS-B",
  lower  = c(sigma = 0)


mle2(minuslogl = LL, start = list(mu = 2, sigma = 1), method = "L-BFGS-B", 
    lower = c(sigma = 0))

      mu    sigma 
4.946803 1.993680 

Log-likelihood: -2108.92 
# compare to an intercept only regression model
broom::tidy(lm(y ~ 1))
# A tibble: 1 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)     4.95    0.0631      78.4       0

We can see that the ML estimates are the same as the intercept only model estimates61, which given the sample size are close to the true values.

In terms of the parameters we estimate, in the typical case of two or more parameters we can think of a likelihood surface that represents the possible likelihood values given any particular set of estimates. Given some starting point, the optimization procedure then travels along the surface looking for a minimum/maximum point62. For simpler settings such as this, we can visualize the likelihood surface and its minimum point. The optimizer travels along this surface until it finds a minimum (the surface plot is interactive- feel free to adjust). I also plot the path of the optimizer from a top down view. The large blue dot noted represents the minimum negative log likelihood.

A bit of jitter was added to the points to better see what’s going on.

Please note that there are many other considerations in optimization completely ignored here, but for our purposes and the audience for which this is intended, we do not want to lose sight of the forest for the trees. We now move next to a slightly more complicated regression example.

Linear Model

In the standard regression context, the expected value for the target variable comes from our linear predictor, i.e. the weighted combination of our explanatory variables, and we estimate the regression weights/coefficients and possibly other relevant parameters. We can expand our previous example to the standard linear model without too much change. In this case we estimate a mean for each observation, but otherwise assume the variance is constant across observations. Again, we first construct some data so that we know exactly what to expect, then write out the likelihood function with starting parameters. As we need to estimate our intercept and coefficient for the X predictor (collectively referred to as \(\beta\)), we can think of our likelihood explicitly as before:

\[\ln\mathcal{L}(\beta, \sigma^2) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-X\beta)^2}{2\sigma^2})]\]

# for replication

# predictor
X = rnorm(1000)

# coefficients for intercept and predictor
beta = c(5, 2)

# add intercept to X and create y with some noise
y = cbind(1, X) %*% beta + rnorm(1000, sd = 2.5)

regLL = function(sigma = 1, Int = 0, b1 = 0){
  coefs = c(Int, b1)
  mu = cbind(1, X) %*% coefs
  ll = -sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
  message(paste(sigma, Int, b1, ll))


mlopt =  mle2(regLL, method = "L-BFGS-B", lower = c(sigma = 0)) 

Maximum likelihood estimation

mle2(minuslogl = regLL, method = "L-BFGS-B", lower = c(sigma = 0))

      Estimate Std. Error z value     Pr(z)    
sigma 2.447823   0.054735  44.721 < 2.2e-16 ***
Int   5.039976   0.077435  65.087 < 2.2e-16 ***
b1    2.139284   0.077652  27.549 < 2.2e-16 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

-2 log L: 4628.273 
# plot(profile(mlopt), absVal = FALSE)

modlm = lm(y ~ X)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     5.04    0.0775      65.0 0        
2 X               2.14    0.0777      27.5 1.58e-124
-2 * logLik(modlm)
'log Lik.' 4628.273 (df=3)

As before, our estimates and final log likelihood value are about where they should be, and reflect the lm output, as the OLS estimates are the maximum likelihood estimates. The visualization becomes more difficult beyond two parameters, but we can examine slices similar to the previous plot.

To move to generalized linear models, very little changes of the process. We will have a potentially different distribution assumed, and we are typically modeling a function of the target variable (e.g. \(\log(y)=X\beta; \mu = e^{X\beta}\)).

Binomial Likelihood Example

This regards the example seen in the early part of the document with the hands-on example.

x1 = rbinom(1000, size = 10, p = .5)
x2 = rbinom(1000, size = 10, p = .85)

binomLL = function(theta, x) {
  -sum(dbinom(x, size = 10, p = theta, log = TRUE))

         x = x1,
         lower = 0,
         upper = 1)
[1] 0.5043001

[1] 1902.557
[1] 5.043
         x = x2,
         lower = 0,
         upper = 1)
[1] 0.8568963

[1] 1438.786
[1] 8.569

Modeling Languages

I will talk only briefly about a couple of the modeling language options available, as you will have to make your own choice among many.


When I first wrote this document, BUGS (Lunn et al. 2012) (Bayesian inference Using Gibbs Sampling) was perhaps the most widely known and used Bayesian modeling language, and it has been around for over 30 years at this point. It is implemented via OpenBUGS and freely available for download63. It even has a GUI interface if such a thing is desired.


JAGS (Just Another Gibbs Sampler) is a more recent dialect of the BUGS language, and is also free to use. It offers some technical and modeling advantages to OpenBUGs, but much of the code translates directly from one to the other.


Nimble allows one to implement BUGS style models within R, but also is highly extensible and provides additional options.


Stan is a relative newcomer to Bayesian modeling languages, and as I write this it is about to have its 10 year anniversary. It uses a different estimation procedure than the BUGS language, and this makes it more flexible and perhaps better behaved for many types of models. It actually compiles Stan code to C++, and so can be very fast as well. In addition, it has interfaces for Python, Julia and more. I personally prefer it as I find it more clear in its expression, and there are many associated tools for model exploration.


R has many modeling packages devoted to Bayesian analysis such that there is a Task View specific to the topic. Most of them are even specific to the implementation of a certain type of analysis64. So not only can you do everything within R and take advantage of the power of those languages, you can then use Bayesian specific R packages on the results. For standard and even some complex models I would suggest using the rstanarm or brms packages as a way to stick to the usual R modeling style, unless you have a notably complicated model, at which point you can use rstan.

General Statistical Packages

The general statistical languages such as SAS, SPSS, and Stata were generally very late to the Bayesian game, even for implementations of Bayesian versions of commonly used models. SAS and Stata are fairly capable from what I can tell (but there is StataStan). Others still seem to be lacking as well. In general, I wouldn’t recommend these packages for Bayesian analysis, except as an interface to one of the Bayesian specific languages, assuming they have the capability and you are already familiar with the program.

Other Programming Languages

Python has functionality via modules such as PyMC, and Stan has a Python implementation, PyStan. Julia and Matlab both have Stan ports as well. And with any programming language that you might use for statistical analysis, you could certainly do a lot of it by hand if you have the time, though you should exhaust tested implementations first.


In short, you have plenty of options. I would suggest starting with a Bayesian programming language or using that language within your chosen statistical environment or package. This gives you the most modeling flexibility, choice, and opportunity to learn.

BUGS Example

The following provides a BUGS example65 of the primary model used in this document. The applicable code for the data set up is in the Example: Linear Regression Model section of the document. The model matrix X must be a matrix class object. Next we setup a BUGS data list as we did with Stan, and create a text file that contains the model code. Note that the data list comprises simple characters which are used to look for objects of those names that are in the environment. Also, I use cat with sink so that I don’t have to go to a separate text editor to create the file.

One of the big differences between BUGS and other languages is its use of the precision parameter \(\frac{1}{\sigma^2}\), the inverse variance, usually denoted as \(\tau\). While there were some computational niceties to be had in doing so, even the authors admit this was not a good decision in retrospect. Prepare to have that issue come up from time to time when you inevitably forget. Comments and assignments are the same as R, and distributions noted with \(\sim\).

bugsdat = list('y', 'X', 'N', 'K')

# This will create a file, lmbugs.txt that will subsequently be called 
'model {
  for (n in 1:N){
    mu[n] <- beta[1]*X[n,1] + beta[2]*X[n,2] + beta[3]*X[n,3] + beta[4]*X[n,4]
    y[n] ~ dnorm(mu[n], inv.sigma.sq)
  for (k in 1:K){
    beta[k] ~ dnorm(0, .001) # prior for reg coefs
  # Half-cauchy as in Gelman 2006
  # Scale parameter is 5, so precision of z = 1/5^2 = 0.04
  sigma.y <- abs(z)/sqrt(chSq) # prior for sigma; cauchy = normal/sqrt(chi^2)
  z ~ dnorm(0, .04)I(0,)
  chSq ~ dgamma(0.5, 0.5) # chi^2 with 1 d.f.
  inv.sigma.sq <- pow(sigma.y, -2) # precision
  # sigma.y ~ dgamma(.001, .001) # prior for sigma; a typical approach used.

# explicitly provided initial values not necessary, but one can specify them  
# as follows, and you may have problems with variance parameters if you don't. 
# Note also that sigma.y is unnecesary if using the half-cauchy approach as  
# it is defined based on other values.

# inits <- list(list(beta=rep(0,4), sigma.y=runif(1,0,10)),
#               list(beta=rep(0,4), sigma.y=runif(1,0,10)),
#               list(beta=rep(0,4), sigma.y=runif(1,0,10)))
# parameters <- c('beta', 'sigma.y')

Now we are ready to run the model. You’ll want to examine the help file for the bugs function for more information. In addition, depending on your setup you may need to set the working directory and other options. Note that n.thin argument is used differently than other packages. One specifies the n posterior draws (per chain) you to keep want as n.iter-n.burnin. The thinned samples aren’t stored. Compare this to other packages where n.iter is the total before thinning and including burn-in, and n.keep is (n.iter-n.burnin)/n.thin. With the function used here, n.keep is the same, but as far as arguments your you’ll want to think of n.iter as the number of posterior draws after thinning. So the following all produce 1000 posterior draws in R2OpenBUGS:

n.iter = 3000 n.thin = 1 n.burnin = 2000
n.iter = 3000 n.thin = 10 n.burnin = 2000
n.iter = 3000 n.thin = 100 n.burnin = 2000

In other packages, with those arguments you’d end up with 1000, 100, and 10 posterior draws.

lmbugs <- bugs(
  inits      = NULL,
  parameters = c('beta', 'sigma.y'),
  model.file = 'lmbugs.txt',
  n.chains   = 3,
  n.iter     = 3000,
  n.thin     = 10,
  n.burnin   = 2000

Now we are ready for the results, which will be the same as what we saw with Stan. In addition to the usual output, you get the deviance information criterion as a potential means for model comparison.

## lmbugs$summary
mean sd 2.5% 50% 97.5% Rhat n.eff
beta[1] 4.90 0.13 4.65 4.90 5.14 1 2400
beta[2] 0.08 0.13 -0.17 0.08 0.34 1 3000
beta[3] -1.47 0.13 -1.72 -1.47 -1.22 1 2100
beta[4] 0.82 0.12 0.59 0.83 1.05 1 3000
sigma.y 2.03 0.09 1.86 2.02 2.22 1 3000
deviance 1063.61 3.15 1059.00 1063.00 1071.00 1 3000

The usual model diagnostics are available with conversion of the results to a general MCMC object, here shown via the the posterior package. You can then use bayesplot or other visualization tools. Figures are not shown, but they are the typical traceplots and density plots.

bugs_samples = posterior::as_draws(lmbugs$sims.array)



JAGS Example

The following shows how to run the regression model presented earlier in the document via JAGS. Once you have the data set up as before, the data list is done in the same fashion as with BUGS. The code itself is mostly identical, save for the use of T instead of I for truncation. JAGS, being a BUGS dialect, also uses the precision parameter in lieu of the variance, which isn’t annoying at all.

jagsdat = list(
  'y' = y,
  'X' = X,
  'N' = N,
  'K' = K

'model {
  for (n in 1:N){
    mu[n] <- beta[1] * X[n, 1] + beta[2] * X[n, 2] + beta[3] * X[n, 3] + beta[4] * X[n, 4]
    y[n] ~ dnorm(mu[n], inv.sigma.sq)
  for (k in 1:K){
    beta[k] ~ dnorm(0, .001)                               
  # Half-cauchy as in Gelman 2006
  # Scale parameter is 5, so precision of z = 1/5^2 = 0.04
  sigma.y <- z/sqrt(chSq)       
  z ~ dnorm(0, .04)T(0,)
  chSq ~ dgamma(0.5, 0.5)                                  
  inv.sigma.sq <- pow(sigma.y, -2)                         

# explicitly provided initial values not necessary, but can specify as follows
# inits <- function() {
#   list(beta = rep(0, 4), sigma.y = runif(1, 0, 10))
# }

parameters = c('beta', 'sigma.y')

With everything set, we can now run the model. With JAGS, we have what might be called an initialization stage that sets the model up and runs through the warm-up period, after which we can then flexibly sample from the posterior via the coda.samples function.

# library(coda)

lmjagsmod = jags.model(
  file     = 'data/lmjags.txt',
  data     = jagsdat,
  n.chains = 3,
  n.adapt  = 2000
  # inits = inits  # if you use them
lmjags = coda.samples(
  model    = lmjagsmod,
  n.iter   = 10000,
  thin     = 10,
  n.chains = 3,
  variable.names = parameters

Now we have a model identical to the others, and can summarize the posterior distribution in similar fashion. As before, you can use the posterior and other packages as well.

term estimate std.error conf.low conf.high ESS
beta[1] 4.89 0.13 4.64 5.15 3000.00
beta[2] 0.08 0.13 -0.17 0.34 3453.26
beta[3] -1.47 0.13 -1.71 -1.22 3000.00
beta[4] 0.82 0.12 0.57 1.06 3000.00
sigma.y 2.02 0.09 1.86 2.22 3000.00

Metropolis Hastings Example

Next depicted is a random walk Metropolis-Hastings algorithm using the data and model from prior sections of the document. I had several texts open while cobbling together this code such as Gelman et al. (2013), and some oriented towards the social sciences by Gill (2008), Jackman (2009), and Lynch (2007) etc. Some parts of the code reflect information and code examples found therein, and follows Lynch’s code a bit more66.

The primary functions that we need to specify regard the posterior distribution67, an update step for beta coefficients, and an update step for the variance estimate.

# posterior function

post = function(x, y, b, s2) {
  # Args: X is the model matrix; y the target vector; b and s2 the parameters
  # to be estimated

  beta   = b           
  sigma  = sqrt(s2)
  sigma2 = s2
  mu     = X %*% beta
  # priors are b0 ~ N(0, sd = 10), sigma2 ~ invGamma(.001, .001)
  priorbvarinv = diag(1/100, 4) 
  prioralpha = priorbeta = .001
  if (is.nan(sigma) | sigma <= 0) {     # scale parameter must be positive
  # Note that you will not find the exact same presentation across texts and 
  # other media for the log posterior in this conjugate setting.  In the end
  # they are conceptually still (log) prior + (log) likelihood (See commented 'else')
  else {                            
    -.5*nrow(X)*log(sigma2) - (.5*(1/sigma2) * (crossprod(y-mu))) +
      -.5*ncol(X)*log(sigma2) - (.5*(1/sigma2) * (t(beta)%*%priorbvarinv%*%beta)) + 
      -(prioralpha+1)*log(sigma2) + log(sigma2) - priorbeta/sigma2 
  # else {
  #   ll = mvtnorm::dmvnorm(
  #     y,
  #     mean = mu,
  #     sigma = diag(sigma2, length(y)),
  #     log = TRUE
  #   )
  #   priorb = mvtnorm::dmvnorm(
  #     beta,
  #     mean = rep(0, length(beta)),
  #     sigma = diag(100, length(beta)),
  #     log = TRUE
  #   )
  #   priors2 = dgamma(1 / sigma2, prioralpha, priorbeta, log = TRUE)
  #   logposterior = ll + priorb + priors2
  #   logposterior
  # }

# update step for regression coefficients

updatereg = function(i, x, y, b, s2) {
  # Args are the same as above but with additional i iterator argument.
  b[i,] = MASS::mvrnorm(1, mu = b[i-1, ], Sigma = bvarscale)  # proposal/jumping distribution
  # Compare to past- does it increase the posterior probability?
  postdiff = 
    post(x = x, y = y, b = b[i, ],   s2 = s2[i-1]) - 
    post(x = x, y = y, b = b[i-1, ], s2 = s2[i-1]) 
  # Acceptance phase
  unidraw = runif(1)
  accept  = unidraw < min(exp(postdiff), 1)  # accept if so
  if (accept) b[i,]
  else b[i-1,]

# update step for sigma2

updates2 = function(i, x, y, b, s2) {
  s2candidate =  rnorm(1, s2[i-1], sd = sigmascale)
  if (s2candidate < 0) {
    accept = FALSE
  else {
    s2diff = 
      post(x = x, y = y, b = b[i, ], s2 = s2candidate) - 
      post(x = x, y = y, b = b[i, ], s2 = s2[i-1])
    unidraw = runif(1)
    accept = unidraw < min(exp(s2diff), 1)
  ifelse(accept, s2candidate, s2[i-1])

Now we can set things up for the MCMC chain68. Aside from the typical MCMC setup and initializing the parameter matrices to hold the draws from the posterior, we also require scale parameters to use for the jumping/proposal distribution.

# Setup, starting values etc. 
nsim   = 5000
burnin = 1000
thin   = 10

b  = matrix(0, nsim, ncol(X))        # initialize beta update matrix
s2 = rep(1, nsim)                    # initialize sigma vector

# For the following this c term comes from BDA3 12.2 and will produce an
# acceptance rate of .44 in 1 dimension and declining from there to about 
# .23 in high dimensions. For the sigmascale, the magic number comes from 
# starting with a value of one and fiddling from there to get around .44.

c = 2.4/sqrt(ncol(b))
bvar = vcov(lm(y ~ ., data.frame(X[, -1]))) 
bvarscale  = bvar * c^2               
sigmascale = .9

We can now run and summarize the model with tools from the coda package.

# Run
for (i in 2:nsim) {
  b[i, ] = updatereg(
    i  = i,
    y  = y,
    x  = X,
    b  = b,
    s2 = s2
  s2[i] = updates2(
    i  = i,
    y  = y,
    x  = X,
    b  = b,
    s2 = s2

# calculate acceptance rates
b_acc_rate  = mean(diff(b[(burnin + 1):nsim,]) != 0)
s2_acc_rate = mean(diff(s2[(burnin + 1):nsim]) != 0)            

tibble(b_acc_rate, s2_acc_rate)

# get final chain

b_final  = as_draws(b[seq(burnin + 1, nsim, by = thin), ])
s2_final = as_draws(matrix(s2[seq(burnin + 1, nsim, by = thin)]))

# get summaries; compare to lm and stan

round(c(coef(modlm), summary(modlm)$sigma ^ 2), 3)
variable mean sd mcse q2.5 q97.5 rhat ess_bulk ess_tail
beta.1 4.90 0.13 0.01 4.64 5.14 1.00 276.19 331.70
beta.2 0.09 0.14 0.01 -0.16 0.36 1.00 228.91 207.34
beta.3 -1.46 0.13 0.01 -1.70 -1.21 1.00 311.59 273.93
beta.4 0.82 0.13 0.01 0.59 1.05 1.00 286.56 283.76
s2 4.09 0.37 0.02 3.42 4.86 1.03 396.93 390.88

Here is the previous Stan fit for comparison.

term estimate std.error conf.low conf.high
beta[1] 4.90 0.13 4.63 5.14
beta[2] 0.09 0.13 -0.18 0.34
beta[3] -1.47 0.13 -1.72 -1.22
beta[4] 0.82 0.12 0.58 1.06
sigma 2.03 0.09 1.86 2.22

Hamiltonian Monte Carlo Example

The following demonstrates Hamiltonian Monte Carlo, the technique that Stan uses, and which is a different estimation approach than the Gibbs sampler in BUGS/JAGS. If you are interested in the details enough to be reading this, I highly recommend Betancourt’s conceptual introduction to HMC. This example still assumes the data we used in this document, and is largely based on the code in the appendix of Gelman et al. (2013).

First we start with the functions.

# Log posterior function
log_p_th = function(X, y, theta) {
  # Args: X is the model matrix; y the target vector; theta are the current 
  # parameter estimates.
  beta   = theta[-length(theta)]           # reg coefs to be estimated
  sigma  = theta[length(theta)]            # sigma to be estimated
  sigma2 = sigma^2
  mu     = X %*% beta
  # priors are b0 ~ N(0, sd=10), sigma2 ~ invGamma(.001, .001)
  priorbvarinv = diag(1/100, 4) 
  prioralpha = priorbeta = .001
  if (is.nan(sigma) | sigma<=0) {     # scale parameter must be positive, so posterior
    return(-Inf)                      # density is zero if it jumps below zero
  # log posterior in this conjugate setting. conceptually it's 
  # (log) prior + (log) likelihood. (See commented 'else')
  else {
    -.5*nrow(X)*log(sigma2) - (.5*(1/sigma2) * (crossprod(y-mu))) +
      -.5*ncol(X)*log(sigma2) - (.5*(1/sigma2) * (t(beta)%*%priorbvarinv%*%beta)) +
      -(prioralpha+1)*log(sigma2) + log(sigma2) - priorbeta/sigma2
  # else {
  #   ll = mvtnorm::dmvnorm(
  #     y,
  #     mean  = mu,
  #     sigma = diag(sigma2, length(y)),
  #     log   = TRUE
  #   )
  #   priorb = mvtnorm::dmvnorm(
  #     beta,
  #     mean  = rep(0, length(beta)),
  #     sigma = diag(100, length(beta)),
  #     log   = TRUE
  #   )
  #   priors2 = dgamma(1/sigma2, prioralpha, priorbeta, log = TRUE)
  #   logposterior = ll + priorb + priors2
  #   logposterior
  # }

# numerical gradient function as given in BDA3 p. 602; same args as posterior
gradient_theta_numerical = function(X, y, theta) {
  d = length(theta)
  e = .0001
  diffs = numeric(5)
  for (k in 1:d) {
    th_hi = theta
    th_lo = theta
    th_hi[k] = theta[k] + e
    th_lo[k] = theta[k] - e
    diffs[k] = (log_p_th(X, y, th_hi) - log_p_th(X, y, th_lo)) / (2*e)

# single HMC iteration function
hmc_iteration = function(
  ) {
  # Args: epsilon is the stepsize; L is the number of leapfrog steps; epsilon
  # and L are drawn randomly at each iteration to explore other areas of the
  # posterior (starting with epsilon0 and L0); M is a diagonal mass matrix 
  # (expressed as a vector), a bit of a magic number in this setting. It regards
  # the mass of a particle whose position is represented by theta, and momentum 
  # by phi. See the sampling section of chapter 1 in the Stan manual for more
  # detail.

  M_inv     = 1/M
  d         = length(theta)
  phi       = rnorm(d, 0, sqrt(M))
  th_old    = theta
  log_p_old = log_p_th(X, y, theta) - .5*sum(M_inv*phi^2)
  phi       = phi + .5 * epsilon * gradient_theta_numerical(X, y, theta)
  for (l in 1:L) {
    theta = theta + epsilon*M_inv*phi
    phi = phi + ifelse(l == L, .5, 1) * 
      epsilon * gradient_theta_numerical(X, y, theta)
  # here we get into standard MCMC stuff, jump or not based on a draw from a
  # proposal distribution
  phi = -phi
  log_p_star = log_p_th(X, y, theta) - .5 * sum(M_inv * phi^2)    
  r = exp(log_p_star - log_p_old)
  if (is.nan(r)) r = 0
  p_jump = min(r, 1)
  if (runif(1) < p_jump) {
    th_new = theta
  else {
    th_new = th_old
  return(list(theta = th_new, p_jump = p_jump))  # returns estimates and acceptance rate

# main HMC function
hmc_run = function(
  starts,    # starting values for theta
  iter,      # total number of simulations for each chain; chain is based on the dimension of starts
  warmup,    # determines initial iterations that will be ignored for inference purposes
  epsilon_0, # the baseline stepsize
  L_0,       # the baseline number of leapfrog steps
  M,         # the mass vector
  ) {
  # Args: 

  chains = nrow(starts)
  d      = ncol(starts)
  sims   = array(NA, c(iter, chains, d), dimnames = list(NULL, NULL, colnames(starts)))
  p_jump = matrix(NA, iter, chains)
  for (j in 1:chains) {
    theta = starts[j, ]
    for (t in 1:iter) {
      epsilon = runif(1, 0, 2 * epsilon_0)
      L = ceiling(2 * L_0 * runif(1))
      temp = hmc_iteration(X, y, theta, epsilon, L, M)
      p_jump[t, j] = temp$p_jump
      sims[t, j,] = temp$theta
      theta = temp$theta
  # rstan::monitor(sims, warmup, digits_summary=3)
  acc = round(colMeans(p_jump[(warmup + 1):iter,]), 3)  # acceptance rate
    'Avg acceptance probability for each chain: ',
    paste0(acc[1], ', ', acc[2]),          '\n'
  return(list(sims = sims, p_jump = p_jump))

With the primary functions in place, we set the starting values and choose other settings for for the HMC process. The coefficient starting values are based on random draws from a uniform distribution, while \(\sigma\) is set to a value of one in each case. For this example we’ll have 2000 total draws with warm-up set to 1000. I don’t have any thinning option here, but that could be added or simply done as part of the coda package preparation.

# Starting values and mcmc settings
parnames = c(paste0('beta[', 1:4, ']'), 'sigma')
d = length(parnames)
chains = 2

thetastart = t(replicate(chains, c(runif(d-1, -1, 1), 1)))
colnames(thetastart) = parnames

nsim   = 2000
warmup = 1000

# fiddle with these to get a desirable acceptance rate of around .80. The
# following work well with the document data.
stepsize = .08
nLeap    = 10

We are now ready to run the model. On my machine and with the above settings, this took a couple seconds. Once complete, we can use the posterior package as we have done before.

# Run the model
fit_hmc = hmc_run(
  starts    = thetastart,
  iter      = nsim,
  warmup    = warmup,
  epsilon_0 = stepsize,
  L_0       = nLeap,
  M         = mass_vector,
  X         = X,
  y         = y

# str(fit_hmc, 1)

# use posterior to summarize the two chains

theta = as_draws_array(fit_hmc$sims[(warmup + 1):nsim, , ])

finalest =  summary(theta)$mean
b   = finalest[1:4]
sig = finalest[5]

# log posterior
log_p_th(X, y, finalest)
Avg acceptance probability for each chain: 0.823, 0.814
variable mean sd mcse q2.5 q97.5 rhat ess_bulk ess_tail
beta[1] 4.90 0.12 0 4.66 5.13 1.01 1261.80 892.17
beta[2] 0.09 0.13 0 -0.16 0.33 1.00 1204.27 792.15
beta[3] -1.47 0.13 0 -1.71 -1.23 1.00 1308.67 1039.06
beta[4] 0.82 0.13 0 0.55 1.07 1.01 1247.88 826.81
sigma 2.01 0.09 0 1.85 2.18 1.00 1389.68 1051.83
log posterior = -301.716

Our estimates look pretty good, and inspection of the diagnostics would show good mixing and convergence as well. At this point we can compare it to the Stan output. For the following, I modified the previous Stan code to use the same inverse gamma prior and tweaked the control options for a little bit more similarity.

term estimate std.error conf.low conf.high
beta[1] 4.90 0.13 4.64 5.14
beta[2] 0.08 0.13 -0.17 0.34
beta[3] -1.47 0.13 -1.72 -1.22
beta[4] 0.82 0.12 0.58 1.06
sigma 2.02 0.09 1.86 2.22
log posterior = -301.532


Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed.
Gill, Jeff. 2008. Bayesian Methods : A Social and Behavioral Sciences Approach. Second. Boca Raton: Chapman & Hall/CRC.
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Chichester, UK: Wiley.
Lunn, David, Chris Jackson, Nicky Best, Andrew Thomas, and David Spiegelhalter. 2012. The BUGS Book: A Practical Introduction to Bayesian Analysis. Boca Raton, FL: Chapman; Hall/CRC.
Lynch, Scott M. 2007. Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. New York: Springer.

  1. Math refresher on logs: log(A*B) = log(A)+log(B). So summing the log probabilities will result in the same values for \(\theta\), but won’t result in extremely small values that will break our computer.↩︎

  2. Of course, we could just use the sample estimates, but this is for demonstration.↩︎

  3. Type ?Distributions at the console for some of the basic R distributions available.↩︎

  4. Much more straightforward than writing the likelihood function as above.↩︎

  5. An alternative approach would be to work with the log of \(\sigma\) which can take on negative values, and then convert it back to the original scale.↩︎

  6. Actually, there is a difference between the sigma estimates in that OLS estimates are based on a variance estimate divided by \(N-p\) while the MLE estimate has a divisor of \(N\).↩︎

  7. Which is equivalent to finding the point where the slope of the tangent line for some function to the surface is zero. The derivative, or gradient in the case of multiple parameters, of the likelihood function with respect to the parameters is known as the score function.↩︎

  8. You might come across a previous incarnation, WinBugs, but it is no longer being developed.↩︎

  9. Many of these packages, if not all of them, will be less flexible in model specification compared to implementing the model in the aforementioned languages directly, or using the R interface to those languages. What’s more, R has interfaces to the previous language engines via the packages nimble, rjags, and rstan, rstanarm, brms, and many others.↩︎

  10. A word of warning, I haven’t even had R2OpenBUGS on my machine in years, so this code for post-processing the model may be a bit iffy depending on how the package works these days. The model code, which is the important part, is fine though. The JAGS section has been recently tested however.↩︎

  11. Some of these texts may have more recent editions, so consult those if interested.↩︎

  12. Assuming normal for \(\beta\) coefficients, inverse gamma on \(\sigma^2\).↩︎

  13. This code regards only one chain, though a simple loop or any number of other approaches would easily extend it to two or more.↩︎