# Appendix

## Maximum Likelihood Review

This is a very brief refresher on maximum likelihood estimation using a standard regression approach as an example, and more or less assumes one hasn’t tried to roll their own such function in a programming environment before. Given the likelihood’s role in Bayesian estimation and statistics in general, and the ties between specific Bayesian results and maximum likelihood estimates one typically comes across, one should be conceptually comfortable with some basic likelihood estimation.

In the standard model setting we attempt to find parameters \(\theta\) that will maximize the probability of the data we actually observe. We’ll start with an observed random target vector \(y\) with \(i...N\) independent and identically distributed observations and some data-generating process underlying it \(f(\cdot|\theta)\). We are interested in estimating the model parameter(s), \(\theta\), that would make the data most likely to have occurred. The probability density function for \(y\) given some particular estimate for the parameters can be noted as \(f(y_i|\theta)\). The joint probability distribution of the (independent) observations given those parameters, \(f(y_i|\theta)\), is the product of the individual densities, and is our *likelihood function*. We can write it out generally as:

\[\mathcal{L}(\theta) = \prod_{i=1}^N f(y_i|\theta)\]

Thus, the *likelihood* for one set of parameter estimates given a fixed set of data y, is equal to the probability of the data given those (fixed) estimates. Furthermore, we can compare one set, \(\mathcal{L}(\theta_A)\), to that of another, \(\mathcal{L}(\theta_B)\), and whichever produces the greater likelihood would be the preferred set of estimates. We can get a sense of this with the following visualization, based on a single parameter, Poisson distributed variable. The data is drawn from a variable with mean \(\theta=5\). We note the calculated likelihood increases as we estimate values for \(\theta\) closer to \(5\). With more and more data, the final ML estimate will converge on the true value.

`Final estimate = 5.02`

For computational reasons, we instead work with the sum of the natural log probabilities^{58}, and thus the *log likelihood*:

\[\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[f(y_i|\theta)]\]

Concretely, we calculate a log likelihood for each observation and then sum them for the total likelihood for parameter(s) \(\theta\).

The likelihood function incorporates our assumption about the sampling distribution of the data given some estimate for the parameters. It can take on many forms and be notably complex depending on the model in question, but once specified, we can use any number of optimization approaches to find the estimates of the parameter that make the data most likely. As an example, for a normally distributed variable of interest we can write the log likelihood as follows:

\[\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-\mu)^2}{2\sigma^2})]\]

### Example

In the following we will demonstrate the maximum likelihood approach to estimation for a simple setting incorporating a normal distribution, where we estimate the mean and variance/sd for a set of values \(y\)^{59}. First the data is created, and then we create the function that will compute the log likelihood. Using the built in R distributions^{60} makes it fairly straightforward to create our own likelihood function and feed it into an optimization function to find the best parameters. We will set things up to work with the bbmle package, which has some nice summary functionality and other features. However, one should take a glance at optim and the other underlying functions that do the work.

```
# for replication
set.seed(1234)
# create the data
y = rnorm(1000, mean=5, sd=2)
startvals = c(0, 1)
# the log likelihood function
LL = function(mu=startvals[1], sigma=startvals[2], verbose=TRUE) {
ll = -sum(dnorm(y, mean=mu, sd=sigma, log=T))
if (verbose) message(paste(mu, sigma, ll))
ll
}
```

The LL function takes starting points for the parameters as arguments, in this case we call them \(\mu\) and \(\sigma\), which will be set to 0 and 1 respectively. Only the first line (ll = -sum…) is actually necessary, and we use dnorm to get the density for each point^{61}. Since this optimizer is by default minimization, we reverse the sign of the sum so as to minimize the negative log likelihood, which is the same as maximizing the likelihood. Note that the bit of other code just allows you to see the estimates as the optimization procedure searches for the best values. I do not show that here but you’ll see it in your console if `verbose=TRUE`

.

We are now ready to obtain maximum likelihood estimates for the parameters. For the mle2 function we will need the function we’ve created, plus other inputs related to that function or the underlying optimizing function used (by default optim). In this case we will use an optimization procedure that will allow us to set a lower bound for \(\sigma\). This isn’t strictly necessary, but otherwise you would get warnings and possibly lack of convergence if negative estimates for \(\sigma\) were allowed^{62}.

```
library(bbmle)
# using optim, and L-BFGS-B so as to constrain sigma to be positive by setting
# the lower bound at zero
mlnorm = mle2(LL, start=list(mu=2, sigma=1), method="L-BFGS-B", lower=c(sigma=0))
mlnorm
```

```
Call:
mle2(minuslogl = LL, start = list(mu = 2, sigma = 1), method = "L-BFGS-B",
lower = c(sigma = 0))
Coefficients:
mu sigma
4.946803 1.993680
Log-likelihood: -2108.92
```

```
Call:
lm(formula = y ~ 1)
Residuals:
Min 1Q Median 3Q Max
-6.7389 -1.2933 -0.0264 1.2848 6.4450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.94681 0.06308 78.42 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.995 on 999 degrees of freedom
```

We can see that the ML estimates are the same as the intercept only model estimates^{63}, which given the sample size are close to the true values.

In terms of the parameters we estimate, in the typical case of two or more parameters we can think of a likelihood surface that represents the possible likelihood values given any particular set of estimates. Given some starting point, the optimization procedure then travels along the surface looking for a minimum/maximum point^{64}. For simpler settings such as this, we can visualize the likelihood surface and its minimum point. The optimizer travels along this surface until it finds a minimum (the surface plot is interactive- feel free to adjust). I also plot the path of the optimizer from a top down view. The large blue dot noted represents the minimum negative log likelihood.

A bit of jitter was added to the points to better see what’s going on.

Please note that there are many other considerations in optimization completely ignored here, but for our purposes and the audience for which this is intended, we do not want to lose sight of the forest for the trees. We now move next to a slightly more complicated regression example.

## Linear Model

In the standard regression context, our expected value for the target variable comes from our linear predictor, i.e. the weighted combination of our explanatory variables, and we estimate the regression weights/coefficients and possibly other relevant parameters. We can expand our previous example to the standard linear model without too much change. In this case we estimate a mean for each observation, but otherwise assume the variance is constant across observations. Again, we first construct some data so that we know exactly what to expect, then write out the likelihood function with starting parameters. As we need to estimate our intercept and coefficient for the X predictor (collectively referred to as \(\beta\)), we can think of our likelihood explicitly as before:

\[\ln\mathcal{L}(\beta, \sigma^2) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-X\beta)^2}{2\sigma^2})]\]

```
# for replication
set.seed(1234)
# predictor
X = rnorm(1000)
# coefficients for intercept and predictor
beta = c(5,2)
# add intercept to X and create y with some noise
y = cbind(1,X)%*%beta + rnorm(1000, sd=2.5)
regLL = function(sigma=1, Int=0, b1=0){
coefs = c(Int, b1)
mu = cbind(1,X)%*%coefs
ll = -sum(dnorm(y, mean=mu, sd=sigma, log=T))
message(paste(sigma, Int, b1, ll))
ll
}
library(bbmle)
mlopt = mle2(regLL, method="L-BFGS-B", lower=c(sigma=0))
summary(mlopt)
```

```
Maximum likelihood estimation
Call:
mle2(minuslogl = regLL, method = "L-BFGS-B", lower = c(sigma = 0))
Coefficients:
Estimate Std. Error z value Pr(z)
sigma 2.447823 0.054735 44.721 < 2.2e-16 ***
Int 5.039976 0.077435 65.087 < 2.2e-16 ***
b1 2.139284 0.077652 27.549 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-2 log L: 4628.273
```

```
Call:
lm(formula = y ~ X)
Residuals:
Min 1Q Median 3Q Max
-7.9152 -1.6097 0.0363 1.6343 7.6711
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.03998 0.07751 65.02 <2e-16 ***
X 2.13928 0.07773 27.52 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.45 on 998 degrees of freedom
Multiple R-squared: 0.4315, Adjusted R-squared: 0.4309
F-statistic: 757.5 on 1 and 998 DF, p-value: < 2.2e-16
```

`'log Lik.' 4628.273 (df=3)`

As before, our estimates and final log likelihood value are about where they should be, and reflect the lm output, as the OLS estimates are the maximum likelihood estimates. The visualization becomes more difficult beyond two parameters, but we can examine slices similar to the previous plot.

To move to generalized linear models, very little changes of the process outside of the distribution assumed and that we are typically modeling a function of the target variable (e.g. \(\log(y)=X\beta; \mu = e^{X\beta}\)).

## Binomial Likelihood Example

This regards the example seen in the early part of the document with the hands-on example.

```
x1 = rbinom(1000, size=10, p=.5)
x2 = rbinom(1000, size=10, p=.85)
binomLL = function(theta, x) {
-sum(dbinom(x, size=10, p=theta, log=T))
}
optimize(binomLL, x=x1, lower=0, upper=1); mean(x1)
```

```
$minimum
[1] 0.5043001
$objective
[1] 1902.557
```

`[1] 5.043`

```
$minimum
[1] 0.8568963
$objective
[1] 1438.786
```

`[1] 8.569`

## Modeling Languages

I will talk only briefly about a couple of the modeling language options available, as you will have to make your own choice among many.

### Bugs

When I first wrote this document, BUGS (Lunn et al. 2012) (*B*ayesian inference *U*sing *G*ibbs *S*ampling) was perhaps the most widely known and used Bayesian modeling language, and it has been around for almost 30 years at this point. It is implemented via OpenBUGS and freely available for download^{65}. It even has a GUI interface if such a thing is desired.

### JAGS

JAGS (*J*ust *A*nother *G*ibbs *S*ampler) is a more recent dialect of the BUGS language, and is also free to use. It offers some technical and modeling advantages to OpenBUGs, but much of the code translates directly from one to the other.

### Stan

Stan is a relative newcomer to Bayesian modeling languages, and as I write this it is about to have the 5th year anniversary of its 1.0 release. It uses a different estimation procedure than the BUGS language and this makes it more flexible and perhaps better behaved for many types of models. It actually compiles Stan code to C++, and so can be very fast as well. I personally prefer it as I find it more clear in its expression, and there are many associated tools for model exploration.

### R

R has many modeling packages devoted to Bayesian analysis such that there is a Task View specific to the topic. Most of them are even specific to the implementation of a certain type of analysis^{66}. So not only can you do everything within R and take advantage of the power of those languages, you can then use Bayesian specific R packages on the results. For standard and even some complex models I would suggest using the rstanarm or brms packages as a way to stick to the usual R modeling style, unless you have a notably complicated model, at which point you can use rstan.

### General Statistical Packages

The general statistical languages such as SAS, SPSS, and Stata were generally *very* late to the Bayesian game, even for implementations of Bayesian versions of commonly used models. SAS started a few years ago with experimental and extremely limited capability, and Stata only very recently (but there is StataStan). SPSS doesn’t seem to have much capability in this area, much like a lot of other things. Others still seem to be lacking as well. In general, I wouldn’t recommend these packages for Bayesian analysis, except as an interface to one of the Bayesian specific languages, assuming they have the capability and you are already familiar with the program.

### Other Programming Languages

Python has functionality via modules such as PyMC, and Stan has a Python implementation, PyStan. Julia and Matlab both have Stan ports as well. And with any programming language that you might use for statistical analysis, you could certainly do a lot of it by hand if you have the time, though you should exhaust tested implementations first.

### Summary

In short, you have plenty of options. I would suggest starting with a Bayesian programming language or using that language within your chosen statistical environment or package. This gives you the most modeling flexibility, choice, and opportunity to learn.

## BUGS Example

The following provides a BUGS example^{67} of the primary model used in this document. The applicable code for the data set up is in the Example: Linear Regression Model section of the document. The model matrix X must be a matrix class object. Next we setup a BUGS data list as we did with Stan, and create a text file that contains the model code. Note that the data list comprises simple characters which are used to look for objects of those names that are in the environment. Also, I use cat with sink so that I don’t have to go to a separate text editor to create the file.

One of the big differences between BUGS and other languages is its use of the precision parameter \(\frac{1}{\sigma^2}\), the inverse variance, usually denoted as \(\tau\). While there were some computational niceties to be had in doing so, even the authors admit this was not a good decision in retrospect. Prepare to have that issue come up from time to time when you inevitably forget. Comments and assignments are the same as R, and distributions noted with \(\sim\).

```
bugsdat = list('y', 'X', 'N', 'K')
# This will create a file, lmbugs.txt that will subsequently be called
sink('data/lmbugs.txt')
cat(
'model {
for (n in 1:N){
mu[n] <- beta[1]*X[n,1] + beta[2]*X[n,2] + beta[3]*X[n,3] + beta[4]*X[n,4]
y[n] ~ dnorm(mu[n], inv.sigma.sq)
}
for (k in 1:K){
beta[k] ~ dnorm(0, .001) # prior for reg coefs
}
# Half-cauchy as in Gelman 2006
# Scale parameter is 5, so precision of z = 1/5^2 = 0.04
sigma.y <- abs(z)/sqrt(chSq) # prior for sigma; cauchy = normal/sqrt(chi^2)
z ~ dnorm(0, .04)I(0,)
chSq ~ dgamma(0.5, 0.5) # chi^2 with 1 d.f.
inv.sigma.sq <- pow(sigma.y, -2) # precision
# sigma.y ~ dgamma(.001, .001) # prior for sigma; a typical approach used.
}'
)
sink()
# explicitly provided initial values not necessary, but one can specify them
# as follows, and you may have problems with variance parameters if you don't.
# Note also that sigma.y is unnecesary if using the half-cauchy approach as
# it is defined based on other values.
# inits <- list(list(beta=rep(0,4), sigma.y=runif(1,0,10)),
# list(beta=rep(0,4), sigma.y=runif(1,0,10)),
# list(beta=rep(0,4), sigma.y=runif(1,0,10)))
# parameters <- c('beta', 'sigma.y')
```

Now we are ready to run the model. You’ll want to examine the help file for the bugs function for more information. In addition, depending on your setup you may need to set the working directory and other options. Note that `n.thin`

argument is used differently than other packages. One specifies the n posterior draws (per chain) you to keep want as n.iter-n.burnin. The thinned samples aren’t stored. Compare this to other packages where n.iter is the total before thinning and including burn-in, and n.keep is (n.iter-n.burnin)/n.thin. With the function used here, n.keep is the same, but as far as arguments your you’ll want to think of n.iter as the number of posterior draws *after* thinning. So the following all produce 1000 posterior draws in R2OpenBUGS:

n.iter=3000 | n.thin=1 | n.burnin=2000 |

n.iter=3000 | n.thin=10 | n.burnin=2000 |

n.iter=3000 | n.thin=100 | n.burnin=2000 |

In other packages, with those arguments you’d end up with 1000, 100, and 10 posterior draws.

```
lmbugs <- bugs(bugsdat, inits=NULL, parameters=c('beta', 'sigma.y'),
model.file='lmbugs.txt', n.chains=3, n.iter=3000, n.thin=10,
n.burnin=2000)
```

Now we are ready for the results, which will be the same as what we saw with Stan. In addition to the usual output, you get the deviance information criterion as a potential means for model comparison.

mean | sd | 2.5% | 50% | 97.5% | Rhat | n.eff | |
---|---|---|---|---|---|---|---|

beta[1] | 4.90 | 0.13 | 4.65 | 4.90 | 5.14 | 1 | 2400 |

beta[2] | 0.08 | 0.13 | -0.17 | 0.08 | 0.34 | 1 | 3000 |

beta[3] | -1.47 | 0.13 | -1.72 | -1.47 | -1.22 | 1 | 2100 |

beta[4] | 0.82 | 0.12 | 0.59 | 0.83 | 1.05 | 1 | 3000 |

sigma.y | 2.03 | 0.09 | 1.86 | 2.02 | 2.22 | 1 | 3000 |

deviance | 1063.61 | 3.15 | 1059.00 | 1063.00 | 1071.00 | 1 | 3000 |

The usual model diagnostics are available with conversion of the results to an object the coda package can work with. Figures are not shown, but they are the typical traceplots and density plots.

## JAGS Example

The following shows how to run the regression model presented earlier in the document via JAGS. Once you have the data set up as before, the data list is done in the same fashion as with BUGS. The code itself is mostly identical, save for the use of T instead of I for truncation. JAGS, being a BUGS dialect, also uses the precision parameter in lieu of the variance, which isn’t annoying at all.

```
jagsdat = list('y'=y, 'X'=X, 'N'=N, 'K'=K)
sink('data/lmjags.txt')
cat(
'model {
for (n in 1:N){
mu[n] <- beta[1] * X[n, 1] + beta[2] * X[n, 2] + beta[3] * X[n, 3] + beta[4] * X[n, 4]
y[n] ~ dnorm(mu[n], inv.sigma.sq)
}
for (k in 1:K){
beta[k] ~ dnorm(0, .001)
}
# Half-cauchy as in Gelman 2006
# Scale parameter is 5, so precision of z = 1/5^2 = 0.04
sigma.y <- z/sqrt(chSq)
z ~ dnorm(0, .04)T(0,)
chSq ~ dgamma(0.5, 0.5)
inv.sigma.sq <- pow(sigma.y, -2)
}'
)
sink()
# explicitly provided initial values not necessary, but can specify as follows
# inits <- function(){
# list(beta=rep(0,4), sigma.y=runif(1,0,10))
# }
parameters = c('beta', 'sigma.y')
```

With everything set, we can now run the model. With JAGS, we have what might be called an initialization stage that sets the model up and runs through the warm-up period, after which we can then flexibly sample from the posterior via the coda.samples function.

```
library(rjags); library(coda)
lmjagsmod = jags.model(file='data/lmjags.txt', data=jagsdat, # inits=inits
n.chains=3, n.adapt=2000)
lmjags = coda.samples(lmjagsmod, c('beta', 'sigma.y'), n.iter=10000,
thin=10, n.chains=3)
```

Now we have a model identical to the others, and can summarize the posterior distribution in similar fashion.

term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|

beta[1] | 4.89 | 0.13 | 4.64 | 5.15 |

beta[2] | 0.08 | 0.13 | -0.17 | 0.34 |

beta[3] | -1.47 | 0.13 | -1.71 | -1.22 |

beta[4] | 0.81 | 0.12 | 0.57 | 1.06 |

sigma.y | 2.03 | 0.09 | 1.86 | 2.22 |

x | |
---|---|

beta[1] | 3000 |

beta[2] | 3453 |

beta[3] | 3000 |

beta[4] | 3000 |

sigma.y | 3000 |

## Metropolis Hastings Example

Next depicted is a random walk Metropolis-Hastings algorithm using the data and model from prior sections of the document. I had several texts open while cobbling together this code such as Gelman et al. (2013), and some oriented towards the social sciences by Gill (2008), Jackman (2009), and Lynch (2007) etc. Some parts of the code reflect information and code examples found therein, and follows Lynch’s code a bit more.

The primary functions that we need to specify regard the posterior distribution^{68}, an update step for beta coefficients, and an update step for the variance estimate.

```
# posterior function
post = function(x, y, b, s2) {
# Args: X is the model matrix; y the target vector; b and s2 the parameters
# to be estimated
beta = b
sigma = sqrt(s2)
sigma2 = s2
mu = X %*% beta
# priors are b0 ~ N(0, sd=10), sigma2 ~ invGamma(.001, .001)
priorbvarinv = diag(1/100, 4)
prioralpha = priorbeta = .001
if (is.nan(sigma) | sigma<=0) { # scale parameter must be positive
return(-Inf)
}
# Note that you will not find the exact same presentation across texts and
# other media for the log posterior in this conjugate setting. In the end
# they are conceptually still (log) prior + (log) likelihood (See commented 'else')
else {
-.5*nrow(X)*log(sigma2) - (.5*(1/sigma2) * (crossprod(y-mu))) +
-.5*ncol(X)*log(sigma2) - (.5*(1/sigma2) * (t(beta)%*%priorbvarinv%*%beta)) +
-(prioralpha+1)*log(sigma2) + log(sigma2) - priorbeta/sigma2
}
# else {
# ll = mvtnorm::dmvnorm(y, mean=mu, sigma=diag(sigma2, length(y)), log=T)
# priorb = mvtnorm::dmvnorm(beta, mean=rep(0, length(beta)), sigma=diag(100, length(beta)), log=T)
# priors2 = dgamma(1/sigma2, prioralpha, priorbeta, log=T)
# logposterior = ll + priorb + priors2
# logposterior
# }
}
# update step for regression coefficients
updatereg = function(i, x, y, b, s2) {
# Args are the same as above but with additional i iterator argument.
b[i,] = MASS::mvrnorm(1, mu=b[i-1,], Sigma=bvarscale) # proposal/jumping distribution
# Compare to past- does it increase the posterior probability?
postdiff = post(x=x, y=y, b=b[i,], s2=s2[i-1]) -
post(x=x, y=y, b=b[i-1,], s2=s2[i-1])
# Acceptance phase
unidraw = runif(1)
accept = unidraw < min(exp(postdiff), 1) # accept if so
if (accept) b[i,]
else b[i-1,]
}
# update step for sigma2
updates2 = function(i, x, y, b, s2) {
s2candidate = rnorm(1, s2[i-1], sd=sigmascale)
if (s2candidate < 0) {
accept = FALSE
}
else {
s2diff = post(x=x, y=y, b=b[i,], s2=s2candidate) -
post(x=x, y=y, b=b[i,], s2=s2[i-1])
unidraw = runif(1)
accept = unidraw < min(exp(s2diff), 1)
}
ifelse(accept, s2candidate, s2[i-1])
}
```

Now we can set things up for the MCMC chain^{69}. Aside from the typical MCMC setup and initializing the parameter matrices to hold the draws from the posterior, we also require scale parameters to use for the jumping/proposal distribution.

```
# Setup, starting values etc.
nsim = 5000
burnin = 1000
thin = 10
b = matrix(0, nsim, ncol(X)) # initialize beta update matrix
s2 = rep(1, nsim) # initialize sigma vector
# For the following this c term comes from BDA3 12.2 and will produce an
# acceptance rate of .44 in 1 dimension and declining from there to about
# .23 in high dimensions. For the sigmascale, the magic number comes from
# starting with a value of one and fiddling from there to get around .44.
c = 2.4/sqrt(ncol(b))
bvar = vcov(lm(y~., data.frame(X[,-1])))
bvarscale = bvar * c^2
sigmascale = .9
```

We can now run and summarize the model with tools from the coda package.

```
# Run
for (i in 2:nsim) {
b[i,] = updatereg(i=i, y=y, x=X, b=b, s2=s2)
s2[i] = updates2(i=i, y=y, x=X, b=b, s2=s2)
}
# calculate acceptance rates
baccrate = mean(diff(b[(burnin+1):nsim,]) != 0)
s2accrate = mean(diff(s2[(burnin+1):nsim]) != 0)
baccrate
s2accrate
# get final chain
library(coda)
bfinal = as.mcmc(b[seq(burnin+1, nsim, by=thin),])
s2final = as.mcmc(s2[seq(burnin+1, nsim, by=thin)])
# get summaries; compare to lm and stan
summary(bfinal); summary(s2final)
round(c(coef(modlm), summary(modlm)$sigma^2), 3)
```

`[1] 0.2943236`

`[1] 0.4243561`

mean | se_mean | sd | X2.5. | X97.5. | n_eff | Rhat | |
---|---|---|---|---|---|---|---|

beta.1 | 4.90 | 0.01 | 0.13 | 4.64 | 5.15 | 308.18 | 1 |

beta.2 | 0.09 | 0.01 | 0.14 | -0.20 | 0.36 | 245.79 | 1 |

beta.3 | -1.46 | 0.01 | 0.13 | -1.70 | -1.20 | 341.54 | 1 |

beta.4 | 0.82 | 0.01 | 0.12 | 0.59 | 1.06 | 360.59 | 1 |

sigmasq | 4.08 | 0.01 | 0.37 | 3.43 | 4.87 | 821.89 | 1 |

Here is the previous Stan fit for comparison.

term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|

beta[1] | 4.90 | 0.14 | 4.64 | 5.16 |

beta[2] | 0.09 | 0.13 | -0.16 | 0.36 |

beta[3] | -1.47 | 0.13 | -1.71 | -1.22 |

beta[4] | 0.82 | 0.12 | 0.57 | 1.05 |

sigma | 2.03 | 0.09 | 1.85 | 2.22 |

## Hamiltonian Monte Carlo Example

The following demonstrates Hamiltonian Monte Carlo, the technique that Stan uses, and which is a different estimation approach than the Gibbs sampler in BUGS/JAGS. If you are interested in the details enough to be reading this, I highly recommend Betancourt’s conceptual introduction to HMC. This example still assumes the data we used in this document, and is largely based on the code in the appendix of Gelman et al. (2013).

First we start with the functions.

```
# Log posterior function
log_p_th = function(X, y, th) {
# Args: X is the model matrix; y the target vector; th is the current
# parameter estimates.
beta = th[-length(th)] # reg coefs to be estimated
sigma = th[length(th)] # sigma to be estimated
sigma2 = sigma^2
mu = X %*% beta
# priors are b0 ~ N(0, sd=10), sigma2 ~ invGamma(.001, .001)
priorbvarinv = diag(1/100, 4)
prioralpha = priorbeta = .001
if (is.nan(sigma) | sigma<=0) { # scale parameter must be positive, so post
return(-Inf) # density is zero if it jumps below zero
}
# log posterior in this conjugate setting. conceptuallyit's
# (log) prior + (log) likelihood. (See commented 'else')
else {
-.5*nrow(X)*log(sigma2) - (.5*(1/sigma2) * (crossprod(y-mu))) +
-.5*ncol(X)*log(sigma2) - (.5*(1/sigma2) * (t(beta)%*%priorbvarinv%*%beta)) +
-(prioralpha+1)*log(sigma2) + log(sigma2) - priorbeta/sigma2
}
# else {
# ll = mvtnorm::dmvnorm(y, mean=mu, sigma=diag(sigma2, length(y)), log=T)
# priorb = mvtnorm::dmvnorm(beta, mean=rep(0, length(beta)), sigma=diag(100, length(beta)), log=T)
# priors2 = dgamma(1/sigma2, prioralpha, priorbeta, log=T)
# logposterior = ll + priorb + priors2
# logposterior
# }
}
# numerical gradient function as given in BDA3 p. 602; same args as posterior
gradient_th_numerical = function(X, y, th) {
d = length(th)
e = .0001
diffs = numeric(5)
for (k in 1:d) {
th_hi = th
th_lo = th
th_hi[k] = th[k] + e
th_lo[k] = th[k] - e
diffs[k] = (log_p_th(X, y, th_hi) - log_p_th(X, y, th_lo)) / (2*e)
}
return(diffs)
}
# single HMC iteration function
hmc_iteration = function(X, y, th, epsilon, L, M) {
# Args: epsilon is the stepsize; L is the number of leapfrog steps; epsilon
# and L are drawn randomly at each iteration to explore other areas of the
# posterior (starting with epsilon0 and L0); M is a diagonal mass matrix
# (expressed as a vector), a bit of a magic number in this setting. It regards
# the mass of a particle whose position is represented by theta, and momentum
# by phi. See the sampling section of chapter 1 in the Stan manual for more
# detail.
M_inv = 1/M
d = length(th)
phi = rnorm(d, 0, sqrt(M))
th_old = th
log_p_old = log_p_th(X, y, th) - .5*sum(M_inv*phi^2)
phi = phi + .5*epsilon*gradient_th_numerical(X, y, th)
for (l in 1:L) {
th = th + epsilon*M_inv*phi
phi = phi + ifelse(l==L, .5, 1) * epsilon*gradient_th_numerical(X, y, th)
}
# here we get into standard MCMC stuff, jump or not based on a draw from a
# proposal distribution
phi = -phi
log_p_star = log_p_th(X, y, th) - .5*sum(M_inv*phi^2)
r = exp(log_p_star - log_p_old)
if (is.nan(r)) r = 0
p_jump = min(r, 1)
if (runif(1) < p_jump) {
th_new = th
}
else {
th_new = th_old
}
return(list(th=th_new, p_jump=p_jump)) # returns estimates and acceptance rate
}
# main HMC function
hmc_run = function(starts, iter, warmup, epsilon_0, L_0, M, X, y) {
# Args: starts are starting values; iter is total number of simulations for
# each chain (note chain is based on the dimension of starts); warmup
# determines which of the initial iterations will be ignored for inference
# purposes; edepsilon0 is the baseline stepsize; L0 is the baseline number
# of leapfrog steps; M is the mass vector
chains = nrow(starts)
d = ncol(starts)
sims = array(NA, c(iter, chains, d),
dimnames=list(NULL, NULL, colnames(starts)))
p_jump = matrix(NA, iter, chains)
for (j in 1:chains) {
th = starts[j,]
for (t in 1:iter) {
epsilon = runif(1, 0, 2*epsilon_0)
L = ceiling(2*L_0*runif(1))
temp = hmc_iteration(X, y, th, epsilon, L, M)
p_jump[t,j] = temp$p_jump
sims[t,j,] = temp$th
th = temp$th
}
}
# rstan::monitor(sims, warmup, digits_summary=3)
acc = round(colMeans(p_jump[(warmup+1):iter,]), 3) # acceptance rate
message('Avg acceptance probability for each chain: ',
paste0(acc[1],', ',acc[2]), '\n')
return(list(sims=sims, p_jump=p_jump))
}
```

With the primary functions in place, we set the starting values and choose other settings for for the HMC process. The coefficient starting values are based on random draws from a uniform distribution, while \(\sigma\) is set to a value of one in each case. As in the other examples we’ll have 5000 total draws with warm-up set to 2500. I don’t have any thinning option here, but that could be added or simply done as part of the coda package preparation.

```
# Starting values and mcmc settings
parnames = c(paste0('beta[',1:4,']'), 'sigma')
d = length(parnames)
chains = 2
thetastart = t(replicate(chains, c(runif(d-1, -1, 1), 1)))
colnames(thetastart) = parnames
nsim = 1000
wu = 500
# fiddle with these to get a desirable acceptance rate of around .80. The
# following work well with the document data.
stepsize = .08
nLeap = 10
vars = rep(1, 5)
```

We are now ready to run the model. On my machine and with the above settings, it took a couple seconds. Once complete we can use the coda package if desired as we have done before.

```
# Run the model
M1 = hmc_run(starts=thetastart, iter=nsim, warmup=wu, epsilon_0=stepsize,
L_0=nLeap, M=mass_vector, X=X, y=y)
# str(M1, 1)
# use coda if desired
library(coda)
theta = as.mcmc.list(list(as.mcmc(M1$sims[(wu+1):nsim, 1,]),
as.mcmc(M1$sims[(wu+1):nsim, 2,])))
summary(theta)
finalest = summary(theta)$statistics[,'Mean']
b = finalest[1:4]
sig = finalest[5]
log_p_th(X, y, finalest)
```

`Avg acceptance probability for each chain: 0.803, 0.794`

mean | se_mean | sd | 2.5% | 97.5% | n_eff | Rhat | |
---|---|---|---|---|---|---|---|

beta[1] | 4.90 | 0.00 | 0.12 | 4.66 | 5.12 | 740.49 | 1 |

beta[2] | 0.09 | 0.01 | 0.13 | -0.17 | 0.34 | 527.68 | 1 |

beta[3] | -1.47 | 0.00 | 0.12 | -1.70 | -1.21 | 666.45 | 1 |

beta[4] | 0.83 | 0.01 | 0.12 | 0.58 | 1.05 | 514.07 | 1 |

sigma | 2.01 | 0.00 | 0.09 | 1.84 | 2.22 | 485.69 | 1 |

`log posterior = -301.718`

Our estimates look pretty good, and inspection of the diagnostics would show good mixing and convergence as well. At this point we can compare it to the Stan output. For the following, I modified the previous Stan code to use the same inverse gamma prior and tweaked the control options for a little bit more similarity, but that’s not necessary.

term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|

beta[1] | 4.89 | 0.13 | 4.64 | 5.14 |

beta[2] | 0.08 | 0.13 | -0.17 | 0.34 |

beta[3] | -1.47 | 0.13 | -1.72 | -1.22 |

beta[4] | 0.82 | 0.12 | 0.58 | 1.06 |

sigma | 2.03 | 0.09 | 1.86 | 2.22 |

`log posterior = -301.532`

### References

Lunn, David, Chris Jackson, Nicky Best, Andrew Thomas, and David Spiegelhalter. 2012. *The BUGS Book: A Practical Introduction to Bayesian Analysis*. Boca Raton, FL: Chapman; Hall/CRC.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. *Bayesian Data Analysis*. 3rd ed.

Gill, Jeff. 2008. *Bayesian Methods : A Social and Behavioral Sciences Approach*. Second. Boca Raton: Chapman & Hall/CRC.

Jackman, Simon. 2009. *Bayesian Analysis for the Social Sciences*. Chichester, UK: Wiley.

Lynch, Scott M. 2007. *Introduction to Applied Bayesian Statistics and Estimation for Social Scientists*. New York: Springer.

Math refresher on logs:

`log(A*B) = log(A)+log(B)`

. So summing the log probabilities will result in the same values for \(\theta\), but won’t result in extremely small values that will break our computer.↩Of course, we could just use the sample estimates, but this is for demonstration.↩

Type

`?Distributions`

at the console for some of the basic R distributions available.↩Much more straightforward than writing the likelihood function as above.↩

An alternative approach would be to work with the log of \(\sigma\) which can take on negative values, and then convert it back to the original scale.↩

Actually, there is a difference between the sigma estimates in that OLS estimates are based on a variance estimate divided by \(N-p\) while the MLE estimate has a divisor of \(N\).↩

Which is equivalent to finding the point where the slope of the tangent line to some function, i.e. the derivative, to the surface is zero. The derivative, or gradient in the case of multiple parameters, of the likelihood function with respect to the parameters is known as the score function.↩

You might come across a previous incarnation, WinBugs, but it is no longer being developed.↩

Many of these packages, if not all of them will be less flexible in model specification compared to implementing the aforementioned languages directly, or using the R interface to those languages. What’s more, R has interfaces to the previous language engines via the packages R2OpenBUGS and BRugs, rjags, and rstan, rstanarm brms, and many others.↩

A word of warning, I haven’t even had R2OpenBUGS on my machine in years, so this code for post-processing the model may be a bit iffy depending on how the package works these days. The model code, which is the important part, is fine though. The JAGS section has been recently tested however.↩

Assuming normal for \(\beta\) coefficients, inverse gamma on \(\sigma^2\).↩

This code regards only one chain, though a simple loop or any number of other approaches would easily extend it to two or more.↩