# Appendix

## Maximum Likelihood Review

This is a very brief refresher on maximum likelihood estimation using a standard regression approach as an example, and more or less assumes one hasn’t tried to roll their own such function in a programming environment before. Given the likelihood’s role in Bayesian estimation and statistics in general, and the ties between specific Bayesian results and maximum likelihood estimates one typically comes across, one should be conceptually comfortable with some basic likelihood estimation.

In the standard model setting we attempt to find parameters $$\theta$$ that will maximize the probability of the data we actually observe. We’ll start with an observed random target vector $$y$$ with $$i...N$$ independent and identically distributed observations and some data-generating process underlying it $$f(\cdot|\theta)$$. We are interested in estimating the model parameter(s), $$\theta$$, that would make the data most likely to have occurred. The probability density function for $$y$$ given some particular estimate for the parameters can be noted as $$f(y_i|\theta)$$. The joint probability distribution of the (independent) observations given those parameters, $$f(y_i|\theta)$$, is the product of the individual densities, and is our likelihood function. We can write it out generally as:

$\mathcal{L}(\theta) = \prod_{i=1}^N f(y_i|\theta)$

Thus, the likelihood for one set of parameter estimates given a fixed set of data y, is equal to the probability of the data given those (fixed) estimates. Furthermore, we can compare one set, $$\mathcal{L}(\theta_A)$$, to that of another, $$\mathcal{L}(\theta_B)$$, and whichever produces the greater likelihood would be the preferred set of estimates. We can get a sense of this with the following visualization, based on a single parameter, Poisson distributed variable. The data is drawn from a variable with mean $$\theta=5$$. We note the calculated likelihood increases as we estimate values for $$\theta$$ closer to $$5$$. With more and more data, the final ML estimate will converge on the true value.

Final estimate = 5.02

For computational reasons, we instead work with the sum of the natural log probabilities58, and thus the log likelihood:

$\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[f(y_i|\theta)]$

Concretely, we calculate a log likelihood for each observation and then sum them for the total likelihood for parameter(s) $$\theta$$.

The likelihood function incorporates our assumption about the sampling distribution of the data given some estimate for the parameters. It can take on many forms and be notably complex depending on the model in question, but once specified, we can use any number of optimization approaches to find the estimates of the parameter that make the data most likely. As an example, for a normally distributed variable of interest we can write the log likelihood as follows:

$\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-\mu)^2}{2\sigma^2})]$

### Example

In the following we will demonstrate the maximum likelihood approach to estimation for a simple setting incorporating a normal distribution, where we estimate the mean and variance/sd for a set of values $$y$$59. First the data is created, and then we create the function that will compute the log likelihood. Using the built in R distributions60 makes it fairly straightforward to create our own likelihood function and feed it into an optimization function to find the best parameters. We will set things up to work with the bbmle package, which has some nice summary functionality and other features. However, one should take a glance at optim and the other underlying functions that do the work.

# for replication
set.seed(1234)

# create the data
y = rnorm(1000, mean=5, sd=2)
startvals = c(0, 1)

# the log likelihood function
LL = function(mu=startvals[1], sigma=startvals[2], verbose=TRUE) {
ll = -sum(dnorm(y, mean=mu, sd=sigma, log=T))
if (verbose) message(paste(mu, sigma, ll))
ll
}

The LL function takes starting points for the parameters as arguments, in this case we call them $$\mu$$ and $$\sigma$$, which will be set to 0 and 1 respectively. Only the first line (ll = -sum…) is actually necessary, and we use dnorm to get the density for each point61. Since this optimizer is by default minimization, we reverse the sign of the sum so as to minimize the negative log likelihood, which is the same as maximizing the likelihood. Note that the bit of other code just allows you to see the estimates as the optimization procedure searches for the best values. I do not show that here but you’ll see it in your console if verbose=TRUE.

We are now ready to obtain maximum likelihood estimates for the parameters. For the mle2 function we will need the function we’ve created, plus other inputs related to that function or the underlying optimizing function used (by default optim). In this case we will use an optimization procedure that will allow us to set a lower bound for $$\sigma$$. This isn’t strictly necessary, but otherwise you would get warnings and possibly lack of convergence if negative estimates for $$\sigma$$ were allowed62.

library(bbmle)
# using optim, and L-BFGS-B so as to constrain sigma to be positive by setting
# the lower bound at zero
mlnorm =  mle2(LL, start=list(mu=2, sigma=1), method="L-BFGS-B", lower=c(sigma=0))
mlnorm

Call:
mle2(minuslogl = LL, start = list(mu = 2, sigma = 1), method = "L-BFGS-B",
lower = c(sigma = 0))

Coefficients:
mu    sigma
4.946803 1.993680

Log-likelihood: -2108.92 
# compare to an intercept only regression model
summary(lm(y~1))

Call:
lm(formula = y ~ 1)

Residuals:
Min      1Q  Median      3Q     Max
-6.7389 -1.2933 -0.0264  1.2848  6.4450

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  4.94681    0.06308   78.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.995 on 999 degrees of freedom

We can see that the ML estimates are the same as the intercept only model estimates63, which given the sample size are close to the true values.

In terms of the parameters we estimate, in the typical case of two or more parameters we can think of a likelihood surface that represents the possible likelihood values given any particular set of estimates. Given some starting point, the optimization procedure then travels along the surface looking for a minimum/maximum point64. For simpler settings such as this, we can visualize the likelihood surface and its minimum point. The optimizer travels along this surface until it finds a minimum (the surface plot is interactive- feel free to adjust). I also plot the path of the optimizer from a top down view. The large blue dot noted represents the minimum negative log likelihood.

A bit of jitter was added to the points to better see what’s going on.

Please note that there are many other considerations in optimization completely ignored here, but for our purposes and the audience for which this is intended, we do not want to lose sight of the forest for the trees. We now move next to a slightly more complicated regression example.

## Linear Model

In the standard regression context, our expected value for the target variable comes from our linear predictor, i.e. the weighted combination of our explanatory variables, and we estimate the regression weights/coefficients and possibly other relevant parameters. We can expand our previous example to the standard linear model without too much change. In this case we estimate a mean for each observation, but otherwise assume the variance is constant across observations. Again, we first construct some data so that we know exactly what to expect, then write out the likelihood function with starting parameters. As we need to estimate our intercept and coefficient for the X predictor (collectively referred to as $$\beta$$), we can think of our likelihood explicitly as before:

$\ln\mathcal{L}(\beta, \sigma^2) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-X\beta)^2}{2\sigma^2})]$

# for replication
set.seed(1234)

# predictor
X = rnorm(1000)

# coefficients for intercept and predictor
beta = c(5,2)

# add intercept to X and create y with some noise
y = cbind(1,X)%*%beta + rnorm(1000, sd=2.5)

regLL = function(sigma=1, Int=0, b1=0){
coefs = c(Int, b1)
mu = cbind(1,X)%*%coefs
ll = -sum(dnorm(y, mean=mu, sd=sigma, log=T))

message(paste(sigma, Int, b1, ll))
ll
}

library(bbmle)
mlopt =  mle2(regLL, method="L-BFGS-B", lower=c(sigma=0))
summary(mlopt)
Maximum likelihood estimation

Call:
mle2(minuslogl = regLL, method = "L-BFGS-B", lower = c(sigma = 0))

Coefficients:
Estimate Std. Error z value     Pr(z)
sigma 2.447823   0.054735  44.721 < 2.2e-16 ***
Int   5.039976   0.077435  65.087 < 2.2e-16 ***
b1    2.139284   0.077652  27.549 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

-2 log L: 4628.273 
# plot(profile(mlopt), absVal=F)

modlm = lm(y~X)
summary(modlm)

Call:
lm(formula = y ~ X)

Residuals:
Min      1Q  Median      3Q     Max
-7.9152 -1.6097  0.0363  1.6343  7.6711

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  5.03998    0.07751   65.02   <2e-16 ***
X            2.13928    0.07773   27.52   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.45 on 998 degrees of freedom
Multiple R-squared:  0.4315,    Adjusted R-squared:  0.4309
F-statistic: 757.5 on 1 and 998 DF,  p-value: < 2.2e-16
-2*logLik(modlm)
'log Lik.' 4628.273 (df=3)

As before, our estimates and final log likelihood value are about where they should be, and reflect the lm output, as the OLS estimates are the maximum likelihood estimates. The visualization becomes more difficult beyond two parameters, but we can examine slices similar to the previous plot.

To move to generalized linear models, very little changes of the process outside of the distribution assumed and that we are typically modeling a function of the target variable (e.g. $$\log(y)=X\beta; \mu = e^{X\beta}$$).

## Binomial Likelihood Example

This regards the example seen in the early part of the document with the hands-on example.

x1 = rbinom(1000, size=10, p=.5)
x2 = rbinom(1000, size=10, p=.85)

binomLL = function(theta, x) {
-sum(dbinom(x, size=10, p=theta, log=T))
}

optimize(binomLL, x=x1, lower=0, upper=1); mean(x1)
$minimum [1] 0.5043001$objective
[1] 1902.557
[1] 5.043
optimize(binomLL, x=x2, lower=0, upper=1); mean(x2)
$minimum [1] 0.8568963$objective
[1] 1438.786
[1] 8.569

## Modeling Languages

I will talk only briefly about a couple of the modeling language options available, as you will have to make your own choice among many.

### Bugs

When I first wrote this document, BUGS (Lunn et al. 2012) (Bayesian inference Using Gibbs Sampling) was perhaps the most widely known and used Bayesian modeling language, and it has been around for almost 30 years at this point. It is implemented via OpenBUGS and freely available for download65. It even has a GUI interface if such a thing is desired.

### JAGS

JAGS (Just Another Gibbs Sampler) is a more recent dialect of the BUGS language, and is also free to use. It offers some technical and modeling advantages to OpenBUGs, but much of the code translates directly from one to the other.

### Stan

Stan is a relative newcomer to Bayesian modeling languages, and as I write this it is about to have the 5th year anniversary of its 1.0 release. It uses a different estimation procedure than the BUGS language and this makes it more flexible and perhaps better behaved for many types of models. It actually compiles Stan code to C++, and so can be very fast as well. I personally prefer it as I find it more clear in its expression, and there are many associated tools for model exploration.

### R

R has many modeling packages devoted to Bayesian analysis such that there is a Task View specific to the topic. Most of them are even specific to the implementation of a certain type of analysis66. So not only can you do everything within R and take advantage of the power of those languages, you can then use Bayesian specific R packages on the results. For standard and even some complex models I would suggest using the rstanarm or brms packages as a way to stick to the usual R modeling style, unless you have a notably complicated model, at which point you can use rstan.

### General Statistical Packages

The general statistical languages such as SAS, SPSS, and Stata were generally very late to the Bayesian game, even for implementations of Bayesian versions of commonly used models. SAS started a few years ago with experimental and extremely limited capability, and Stata only very recently (but there is StataStan). SPSS doesn’t seem to have much capability in this area, much like a lot of other things. Others still seem to be lacking as well. In general, I wouldn’t recommend these packages for Bayesian analysis, except as an interface to one of the Bayesian specific languages, assuming they have the capability and you are already familiar with the program.

### Other Programming Languages

Python has functionality via modules such as PyMC, and Stan has a Python implementation, PyStan. Julia and Matlab both have Stan ports as well. And with any programming language that you might use for statistical analysis, you could certainly do a lot of it by hand if you have the time, though you should exhaust tested implementations first.

### Summary

In short, you have plenty of options. I would suggest starting with a Bayesian programming language or using that language within your chosen statistical environment or package. This gives you the most modeling flexibility, choice, and opportunity to learn.

## BUGS Example

The following provides a BUGS example67 of the primary model used in this document. The applicable code for the data set up is in the Example: Linear Regression Model section of the document. The model matrix X must be a matrix class object. Next we setup a BUGS data list as we did with Stan, and create a text file that contains the model code. Note that the data list comprises simple characters which are used to look for objects of those names that are in the environment. Also, I use cat with sink so that I don’t have to go to a separate text editor to create the file.

One of the big differences between BUGS and other languages is its use of the precision parameter $$\frac{1}{\sigma^2}$$, the inverse variance, usually denoted as $$\tau$$. While there were some computational niceties to be had in doing so, even the authors admit this was not a good decision in retrospect. Prepare to have that issue come up from time to time when you inevitably forget. Comments and assignments are the same as R, and distributions noted with $$\sim$$.

bugsdat = list('y', 'X', 'N', 'K')

# This will create a file, lmbugs.txt that will subsequently be called
sink('data/lmbugs.txt')
cat(
'model {
for (n in 1:N){
mu[n] <- beta[1]*X[n,1] + beta[2]*X[n,2] + beta[3]*X[n,3] + beta[4]*X[n,4]
y[n] ~ dnorm(mu[n], inv.sigma.sq)
}
for (k in 1:K){
beta[k] ~ dnorm(0, .001) # prior for reg coefs
}
# Half-cauchy as in Gelman 2006
# Scale parameter is 5, so precision of z = 1/5^2 = 0.04
sigma.y <- abs(z)/sqrt(chSq) # prior for sigma; cauchy = normal/sqrt(chi^2)
z ~ dnorm(0, .04)I(0,)
chSq ~ dgamma(0.5, 0.5) # chi^2 with 1 d.f.
inv.sigma.sq <- pow(sigma.y, -2) # precision
# sigma.y ~ dgamma(.001, .001) # prior for sigma; a typical approach used.
}'
)
sink()

# explicitly provided initial values not necessary, but one can specify them
# as follows, and you may have problems with variance parameters if you don't.
# Note also that sigma.y is unnecesary if using the half-cauchy approach as
# it is defined based on other values.

# inits <- list(list(beta=rep(0,4), sigma.y=runif(1,0,10)),
#               list(beta=rep(0,4), sigma.y=runif(1,0,10)),
#               list(beta=rep(0,4), sigma.y=runif(1,0,10)))
# parameters <- c('beta', 'sigma.y')

Now we are ready to run the model. You’ll want to examine the help file for the bugs function for more information. In addition, depending on your setup you may need to set the working directory and other options. Note that n.thin argument is used differently than other packages. One specifies the n posterior draws (per chain) you to keep want as n.iter-n.burnin. The thinned samples aren’t stored. Compare this to other packages where n.iter is the total before thinning and including burn-in, and n.keep is (n.iter-n.burnin)/n.thin. With the function used here, n.keep is the same, but as far as arguments your you’ll want to think of n.iter as the number of posterior draws after thinning. So the following all produce 1000 posterior draws in R2OpenBUGS:

 n.iter=3000 n.thin=1 n.burnin=2000 n.iter=3000 n.thin=10 n.burnin=2000 n.iter=3000 n.thin=100 n.burnin=2000

In other packages, with those arguments you’d end up with 1000, 100, and 10 posterior draws.

lmbugs <- bugs(bugsdat, inits=NULL, parameters=c('beta', 'sigma.y'),
model.file='lmbugs.txt', n.chains=3, n.iter=3000, n.thin=10,
n.burnin=2000)

Now we are ready for the results, which will be the same as what we saw with Stan. In addition to the usual output, you get the deviance information criterion as a potential means for model comparison.

## lmbugs$summary mean sd 2.5% 50% 97.5% Rhat n.eff beta[1] 4.90 0.13 4.65 4.90 5.14 1 2400 beta[2] 0.08 0.13 -0.17 0.08 0.34 1 3000 beta[3] -1.47 0.13 -1.72 -1.47 -1.22 1 2100 beta[4] 0.82 0.12 0.59 0.83 1.05 1 3000 sigma.y 2.03 0.09 1.86 2.02 2.22 1 3000 deviance 1063.61 3.15 1059.00 1063.00 1071.00 1 3000 The usual model diagnostics are available with conversion of the results to an object the coda package can work with. Figures are not shown, but they are the typical traceplots and density plots. lmbugscoda = as.mcmc.list(lmbugs) traceplot(lmbugscoda) densityplot(lmbugscoda) plot(lmbugscoda) corrplot:::corrplot(cor(lmbugscoda[[2]])) # noticeably better than levelplot ## JAGS Example The following shows how to run the regression model presented earlier in the document via JAGS. Once you have the data set up as before, the data list is done in the same fashion as with BUGS. The code itself is mostly identical, save for the use of T instead of I for truncation. JAGS, being a BUGS dialect, also uses the precision parameter in lieu of the variance, which isn’t annoying at all. jagsdat = list('y'=y, 'X'=X, 'N'=N, 'K'=K) sink('data/lmjags.txt') cat( 'model { for (n in 1:N){ mu[n] <- beta[1] * X[n, 1] + beta[2] * X[n, 2] + beta[3] * X[n, 3] + beta[4] * X[n, 4] y[n] ~ dnorm(mu[n], inv.sigma.sq) } for (k in 1:K){ beta[k] ~ dnorm(0, .001) } # Half-cauchy as in Gelman 2006 # Scale parameter is 5, so precision of z = 1/5^2 = 0.04 sigma.y <- z/sqrt(chSq) z ~ dnorm(0, .04)T(0,) chSq ~ dgamma(0.5, 0.5) inv.sigma.sq <- pow(sigma.y, -2) }' ) sink() # explicitly provided initial values not necessary, but can specify as follows # inits <- function(){ # list(beta=rep(0,4), sigma.y=runif(1,0,10)) # } parameters = c('beta', 'sigma.y') With everything set, we can now run the model. With JAGS, we have what might be called an initialization stage that sets the model up and runs through the warm-up period, after which we can then flexibly sample from the posterior via the coda.samples function. library(rjags); library(coda) lmjagsmod = jags.model(file='data/lmjags.txt', data=jagsdat, # inits=inits n.chains=3, n.adapt=2000) lmjags = coda.samples(lmjagsmod, c('beta', 'sigma.y'), n.iter=10000, thin=10, n.chains=3) Now we have a model identical to the others, and can summarize the posterior distribution in similar fashion. summary(lmjags) coda::effectiveSize(lmjags) term estimate std.error conf.low conf.high beta[1] 4.89 0.13 4.64 5.15 beta[2] 0.08 0.13 -0.17 0.34 beta[3] -1.47 0.13 -1.71 -1.22 beta[4] 0.81 0.12 0.57 1.06 sigma.y 2.03 0.09 1.86 2.22 x beta[1] 3000 beta[2] 3453 beta[3] 3000 beta[4] 3000 sigma.y 3000 ## Metropolis Hastings Example Next depicted is a random walk Metropolis-Hastings algorithm using the data and model from prior sections of the document. I had several texts open while cobbling together this code such as Gelman et al. (2013), and some oriented towards the social sciences by Gill (2008), Jackman (2009), and Lynch (2007) etc. Some parts of the code reflect information and code examples found therein, and follows Lynch’s code a bit more. The primary functions that we need to specify regard the posterior distribution68, an update step for beta coefficients, and an update step for the variance estimate. # posterior function post = function(x, y, b, s2) { # Args: X is the model matrix; y the target vector; b and s2 the parameters # to be estimated beta = b sigma = sqrt(s2) sigma2 = s2 mu = X %*% beta # priors are b0 ~ N(0, sd=10), sigma2 ~ invGamma(.001, .001) priorbvarinv = diag(1/100, 4) prioralpha = priorbeta = .001 if (is.nan(sigma) | sigma<=0) { # scale parameter must be positive return(-Inf) } # Note that you will not find the exact same presentation across texts and # other media for the log posterior in this conjugate setting. In the end # they are conceptually still (log) prior + (log) likelihood (See commented 'else') else { -.5*nrow(X)*log(sigma2) - (.5*(1/sigma2) * (crossprod(y-mu))) + -.5*ncol(X)*log(sigma2) - (.5*(1/sigma2) * (t(beta)%*%priorbvarinv%*%beta)) + -(prioralpha+1)*log(sigma2) + log(sigma2) - priorbeta/sigma2 } # else { # ll = mvtnorm::dmvnorm(y, mean=mu, sigma=diag(sigma2, length(y)), log=T) # priorb = mvtnorm::dmvnorm(beta, mean=rep(0, length(beta)), sigma=diag(100, length(beta)), log=T) # priors2 = dgamma(1/sigma2, prioralpha, priorbeta, log=T) # logposterior = ll + priorb + priors2 # logposterior # } } # update step for regression coefficients updatereg = function(i, x, y, b, s2) { # Args are the same as above but with additional i iterator argument. b[i,] = MASS::mvrnorm(1, mu=b[i-1,], Sigma=bvarscale) # proposal/jumping distribution # Compare to past- does it increase the posterior probability? postdiff = post(x=x, y=y, b=b[i,], s2=s2[i-1]) - post(x=x, y=y, b=b[i-1,], s2=s2[i-1]) # Acceptance phase unidraw = runif(1) accept = unidraw < min(exp(postdiff), 1) # accept if so if (accept) b[i,] else b[i-1,] } # update step for sigma2 updates2 = function(i, x, y, b, s2) { s2candidate = rnorm(1, s2[i-1], sd=sigmascale) if (s2candidate < 0) { accept = FALSE } else { s2diff = post(x=x, y=y, b=b[i,], s2=s2candidate) - post(x=x, y=y, b=b[i,], s2=s2[i-1]) unidraw = runif(1) accept = unidraw < min(exp(s2diff), 1) } ifelse(accept, s2candidate, s2[i-1]) } Now we can set things up for the MCMC chain69. Aside from the typical MCMC setup and initializing the parameter matrices to hold the draws from the posterior, we also require scale parameters to use for the jumping/proposal distribution. # Setup, starting values etc. nsim = 5000 burnin = 1000 thin = 10 b = matrix(0, nsim, ncol(X)) # initialize beta update matrix s2 = rep(1, nsim) # initialize sigma vector # For the following this c term comes from BDA3 12.2 and will produce an # acceptance rate of .44 in 1 dimension and declining from there to about # .23 in high dimensions. For the sigmascale, the magic number comes from # starting with a value of one and fiddling from there to get around .44. c = 2.4/sqrt(ncol(b)) bvar = vcov(lm(y~., data.frame(X[,-1]))) bvarscale = bvar * c^2 sigmascale = .9 We can now run and summarize the model with tools from the coda package. # Run for (i in 2:nsim) { b[i,] = updatereg(i=i, y=y, x=X, b=b, s2=s2) s2[i] = updates2(i=i, y=y, x=X, b=b, s2=s2) } # calculate acceptance rates baccrate = mean(diff(b[(burnin+1):nsim,]) != 0) s2accrate = mean(diff(s2[(burnin+1):nsim]) != 0) baccrate s2accrate # get final chain library(coda) bfinal = as.mcmc(b[seq(burnin+1, nsim, by=thin),]) s2final = as.mcmc(s2[seq(burnin+1, nsim, by=thin)]) # get summaries; compare to lm and stan summary(bfinal); summary(s2final) round(c(coef(modlm), summary(modlm)$sigma^2), 3)
[1] 0.2943236
[1] 0.4243561
mean se_mean sd X2.5. X97.5. n_eff Rhat
beta.1 4.90 0.01 0.13 4.64 5.15 308.18 1
beta.2 0.09 0.01 0.14 -0.20 0.36 245.79 1
beta.3 -1.46 0.01 0.13 -1.70 -1.20 341.54 1
beta.4 0.82 0.01 0.12 0.59 1.06 360.59 1
sigmasq 4.08 0.01 0.37 3.43 4.87 821.89 1

Here is the previous Stan fit for comparison.

term estimate std.error conf.low conf.high
beta[1] 4.90 0.14 4.64 5.16
beta[2] 0.09 0.13 -0.16 0.36
beta[3] -1.47 0.13 -1.71 -1.22
beta[4] 0.82 0.12 0.57 1.05
sigma 2.03 0.09 1.85 2.22

## Hamiltonian Monte Carlo Example

The following demonstrates Hamiltonian Monte Carlo, the technique that Stan uses, and which is a different estimation approach than the Gibbs sampler in BUGS/JAGS. If you are interested in the details enough to be reading this, I highly recommend Betancourt’s conceptual introduction to HMC. This example still assumes the data we used in this document, and is largely based on the code in the appendix of Gelman et al. (2013).

# Log posterior function
log_p_th = function(X, y, th) {
# Args: X is the model matrix; y the target vector; th is the current
# parameter estimates.
beta = th[-length(th)]            # reg coefs to be estimated
sigma = th[length(th)]            # sigma to be estimated
sigma2 = sigma^2
mu = X %*% beta

# priors are b0 ~ N(0, sd=10), sigma2 ~ invGamma(.001, .001)
priorbvarinv = diag(1/100, 4)
prioralpha = priorbeta = .001

if (is.nan(sigma) | sigma<=0) {     # scale parameter must be positive, so post
return(-Inf)                    # density is zero if it jumps below zero
}
# log posterior in this conjugate setting. conceptuallyit's
# (log) prior + (log) likelihood. (See commented 'else')
else {
-.5*nrow(X)*log(sigma2) - (.5*(1/sigma2) * (crossprod(y-mu))) +
-.5*ncol(X)*log(sigma2) - (.5*(1/sigma2) * (t(beta)%*%priorbvarinv%*%beta)) +
-(prioralpha+1)*log(sigma2) + log(sigma2) - priorbeta/sigma2
}
# else {
#   ll = mvtnorm::dmvnorm(y, mean=mu, sigma=diag(sigma2, length(y)), log=T)
#   priorb = mvtnorm::dmvnorm(beta, mean=rep(0, length(beta)), sigma=diag(100, length(beta)), log=T)
#   priors2 = dgamma(1/sigma2, prioralpha, priorbeta, log=T)
#   logposterior = ll + priorb + priors2
#   logposterior
# }
}

# numerical gradient function as given in BDA3 p. 602; same args as posterior
gradient_th_numerical = function(X, y, th) {
d = length(th)
e = .0001
diffs = numeric(5)
for (k in 1:d) {
th_hi = th
th_lo = th
th_hi[k] = th[k] + e
th_lo[k] = th[k] - e
diffs[k] = (log_p_th(X, y, th_hi) - log_p_th(X, y, th_lo)) / (2*e)
}
return(diffs)
}

# single HMC iteration function
hmc_iteration = function(X, y, th, epsilon, L, M) {
# Args: epsilon is the stepsize; L is the number of leapfrog steps; epsilon
# and L are drawn randomly at each iteration to explore other areas of the
# posterior (starting with epsilon0 and L0); M is a diagonal mass matrix
# (expressed as a vector), a bit of a magic number in this setting. It regards
# the mass of a particle whose position is represented by theta, and momentum
# by phi. See the sampling section of chapter 1 in the Stan manual for more
# detail.

M_inv = 1/M
d = length(th)
phi = rnorm(d, 0, sqrt(M))
th_old = th
log_p_old = log_p_th(X, y, th) - .5*sum(M_inv*phi^2)
phi = phi + .5*epsilon*gradient_th_numerical(X, y, th)

for (l in 1:L) {
th = th + epsilon*M_inv*phi
phi = phi + ifelse(l==L, .5, 1) * epsilon*gradient_th_numerical(X, y, th)
}

# here we get into standard MCMC stuff, jump or not based on a draw from a
# proposal distribution
phi = -phi
log_p_star = log_p_th(X, y, th) - .5*sum(M_inv*phi^2)
r = exp(log_p_star - log_p_old)
if (is.nan(r)) r = 0
p_jump = min(r, 1)
if (runif(1) < p_jump) {
th_new = th
}
else {
th_new = th_old
}
return(list(th=th_new, p_jump=p_jump))  # returns estimates and acceptance rate
}

# main HMC function
hmc_run = function(starts, iter, warmup, epsilon_0, L_0, M, X, y) {
# Args: starts are starting values; iter is total number of simulations for
# each chain (note chain is based on the dimension of starts); warmup
# determines which of the initial iterations will be ignored for inference
# purposes; edepsilon0 is the baseline stepsize; L0 is the baseline number
# of leapfrog steps; M is the mass vector
chains = nrow(starts)
d = ncol(starts)
sims = array(NA, c(iter, chains, d),
dimnames=list(NULL, NULL, colnames(starts)))
p_jump = matrix(NA, iter, chains)

for (j in 1:chains) {
th = starts[j,]
for (t in 1:iter) {
epsilon = runif(1, 0, 2*epsilon_0)
L = ceiling(2*L_0*runif(1))
temp = hmc_iteration(X, y, th, epsilon, L, M)
p_jump[t,j] = temp$p_jump sims[t,j,] = temp$th
th = temp$th } } # rstan::monitor(sims, warmup, digits_summary=3) acc = round(colMeans(p_jump[(warmup+1):iter,]), 3) # acceptance rate message('Avg acceptance probability for each chain: ', paste0(acc[1],', ',acc[2]), '\n') return(list(sims=sims, p_jump=p_jump)) } With the primary functions in place, we set the starting values and choose other settings for for the HMC process. The coefficient starting values are based on random draws from a uniform distribution, while $$\sigma$$ is set to a value of one in each case. As in the other examples we’ll have 5000 total draws with warm-up set to 2500. I don’t have any thinning option here, but that could be added or simply done as part of the coda package preparation. # Starting values and mcmc settings parnames = c(paste0('beta[',1:4,']'), 'sigma') d = length(parnames) chains = 2 thetastart = t(replicate(chains, c(runif(d-1, -1, 1), 1))) colnames(thetastart) = parnames nsim = 1000 wu = 500 # fiddle with these to get a desirable acceptance rate of around .80. The # following work well with the document data. stepsize = .08 nLeap = 10 vars = rep(1, 5) We are now ready to run the model. On my machine and with the above settings, it took a couple seconds. Once complete we can use the coda package if desired as we have done before. # Run the model M1 = hmc_run(starts=thetastart, iter=nsim, warmup=wu, epsilon_0=stepsize, L_0=nLeap, M=mass_vector, X=X, y=y) # str(M1, 1) # use coda if desired library(coda) theta = as.mcmc.list(list(as.mcmc(M1$sims[(wu+1):nsim, 1,]),
as.mcmc(M1$sims[(wu+1):nsim, 2,]))) summary(theta) finalest = summary(theta)$statistics[,'Mean']
b = finalest[1:4]
sig = finalest[5]
log_p_th(X, y, finalest)
Avg acceptance probability for each chain: 0.803, 0.794
mean se_mean sd 2.5% 97.5% n_eff Rhat
beta[1] 4.90 0.00 0.12 4.66 5.12 740.49 1
beta[2] 0.09 0.01 0.13 -0.17 0.34 527.68 1
beta[3] -1.47 0.00 0.12 -1.70 -1.21 666.45 1
beta[4] 0.83 0.01 0.12 0.58 1.05 514.07 1
sigma 2.01 0.00 0.09 1.84 2.22 485.69 1
log posterior = -301.718

Our estimates look pretty good, and inspection of the diagnostics would show good mixing and convergence as well. At this point we can compare it to the Stan output. For the following, I modified the previous Stan code to use the same inverse gamma prior and tweaked the control options for a little bit more similarity, but that’s not necessary.

term estimate std.error conf.low conf.high
beta[1] 4.89 0.13 4.64 5.14
beta[2] 0.08 0.13 -0.17 0.34
beta[3] -1.47 0.13 -1.72 -1.22
beta[4] 0.82 0.12 0.58 1.06
sigma 2.03 0.09 1.86 2.22
log posterior = -301.532

### References

Lunn, David, Chris Jackson, Nicky Best, Andrew Thomas, and David Spiegelhalter. 2012. The BUGS Book: A Practical Introduction to Bayesian Analysis. Boca Raton, FL: Chapman; Hall/CRC.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed.

Gill, Jeff. 2008. Bayesian Methods : A Social and Behavioral Sciences Approach. Second. Boca Raton: Chapman & Hall/CRC.

Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Chichester, UK: Wiley.

Lynch, Scott M. 2007. Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. New York: Springer.

1. Math refresher on logs: log(A*B) = log(A)+log(B). So summing the log probabilities will result in the same values for $$\theta$$, but won’t result in extremely small values that will break our computer.

2. Of course, we could just use the sample estimates, but this is for demonstration.

3. Type ?Distributions at the console for some of the basic R distributions available.

4. Much more straightforward than writing the likelihood function as above.

5. An alternative approach would be to work with the log of $$\sigma$$ which can take on negative values, and then convert it back to the original scale.

6. Actually, there is a difference between the sigma estimates in that OLS estimates are based on a variance estimate divided by $$N-p$$ while the MLE estimate has a divisor of $$N$$.

7. Which is equivalent to finding the point where the slope of the tangent line to some function, i.e. the derivative, to the surface is zero. The derivative, or gradient in the case of multiple parameters, of the likelihood function with respect to the parameters is known as the score function.

8. You might come across a previous incarnation, WinBugs, but it is no longer being developed.

9. Many of these packages, if not all of them will be less flexible in model specification compared to implementing the aforementioned languages directly, or using the R interface to those languages. What’s more, R has interfaces to the previous language engines via the packages R2OpenBUGS and BRugs, rjags, and rstan, rstanarm brms, and many others.

10. A word of warning, I haven’t even had R2OpenBUGS on my machine in years, so this code for post-processing the model may be a bit iffy depending on how the package works these days. The model code, which is the important part, is fine though. The JAGS section has been recently tested however.

11. Assuming normal for $$\beta$$ coefficients, inverse gamma on $$\sigma^2$$.

12. This code regards only one chain, though a simple loop or any number of other approaches would easily extend it to two or more.