Hamiltonian Monte Carlo
The following demonstrates Hamiltonian Monte Carlo, the technique that Stan uses, and which is a different estimation approach than the Gibbs sampler in BUGS/JAGS. If you are interested in the details enough to be reading this, I highly recommend Betancourt’s conceptual introduction to HMC. This example is largely based on the code in the appendix of Gelman.
- Gelman et al. 2013. Bayesian Data Analysis. 3rd ed.
Data Setup
As with the Metropolis Hastings chapter, we use some data for a standard linear regression.
library(tidyverse)
# set seed for replicability
set.seed(8675309)
# create a N x k matrix of covariates
= 250
N = 3
K
= replicate(K, rnorm(n = N))
covariates colnames(covariates) = c('X1', 'X2', 'X3')
# create the model matrix with intercept
= cbind(Intercept = 1, covariates)
X
# create a normally distributed variable that is a function of the covariates
= c(5, .2, -1.5, .9)
coefs = 2
sigma = X %*% coefs
mu = rnorm(N, mu, sigma)
y
# same as
# y = 5 + .2*X1 - 1.5*X2 + .9*X3 + rnorm(N, mean = 0, sd = 2)
# Run lm for later comparison; but go ahead and examine now if desired
= lm(y ~ ., data = data.frame(X[, -1]))
fit_lm # summary(fit_lm)
Functions
First we start with the log posterior function.
<- function(X, y, th) {
log_posterior # Args
# X: the model matrix
# y: the target vector
# th: theta, the current parameter estimates
= th[-length(th)] # reg coefs to be estimated
beta = th[length(th)] # sigma to be estimated
sigma = sigma^2
sigma2 = X %*% beta
mu
# priors are b0 ~ N(0, sd=10), sigma2 ~ invGamma(.001, .001)
= diag(1/100, 4)
priorbvarinv = priorbeta = .001
prioralpha
if (is.nan(sigma) | sigma<=0) { # scale parameter must be positive, so post
return(-Inf) # density is zero if it jumps below zero
}# log posterior in this conjugate setting. conceptually it's (log) prior +
# (log) likelihood. (See commented 'else' for alternative)
else {
-.5*nrow(X)*log(sigma2) - (.5*(1/sigma2) * (crossprod(y-mu))) +
-.5*ncol(X)*log(sigma2) - (.5*(1/sigma2) * (t(beta) %*% priorbvarinv %*% beta)) +
-(prioralpha + 1)*log(sigma2) + log(sigma2) - priorbeta/sigma2
}# else {
# ll = mvtnorm::dmvnorm(y, mean=mu, sigma=diag(sigma2, length(y)), log=T)
# priorb = mvtnorm::dmvnorm(beta, mean=rep(0, length(beta)), sigma=diag(100, length(beta)), log=T)
# priors2 = dgamma(1/sigma2, prioralpha, priorbeta, log=T)
# logposterior = ll + priorb + priors2
# logposterior
# }
}
The following is the numerical gradient function as given in BDA3 p. 602. It has the same arguments as the log posterior function.
<- function(X, y, th) {
gradient_theta = length(th)
d = .0001
e = numeric(d)
diffs
for (k in 1:d) {
= th
th_hi = th
th_lo = th[k] + e
th_hi[k] = th[k] - e
th_lo[k] = (log_posterior(X, y, th_hi) - log_posterior(X, y, th_lo)) / (2 * e)
diffs[k]
}
diffs }
The following is a function for a single HMC iteration. ϵ and L are drawn randomly at each
iteration to explore other areas of the posterior (starting with epsilon0
and
L0
); The mass matrix M
, expressed as a vector, is a bit of a magic number in
this setting. It regards the mass of a particle whose position is represented by
\(\theta\), and momentum by \(\phi\). See the sampling section of the Stan
manual
for more detail.
<- function(X, y, th, epsilon, L, M) {
hmc_iteration # Args
# epsilon: the stepsize
# L: the number of leapfrog steps
# M: a diagonal mass matrix
# initialization
= 1/M
M_inv = length(th)
d = rnorm(d, 0, sqrt(M))
phi = th
th_old
= log_posterior(X, y, th) - .5*sum(M_inv * phi^2)
log_p_old
= phi + .5 * epsilon * gradient_theta(X, y, th)
phi
for (l in 1:L) {
= th + epsilon*M_inv*phi
th = phi + ifelse(l == L, .5, 1) * epsilon * gradient_theta(X, y, th)
phi
}
# here we get into standard MCMC stuff, jump or not based on a draw from a
# proposal distribution
= -phi
phi = log_posterior(X, y, th) - .5*sum(M_inv * phi^2)
log_p_star = exp(log_p_star - log_p_old)
r
if (is.nan(r)) r = 0
= min(r, 1)
p_jump
if (runif(1) < p_jump) {
= th
th_new
}else {
= th_old
th_new
}
# returns estimates and acceptance rate
list(th = th_new, p_jump = p_jump)
}
Main HMC function.
<- function(starts, iter, warmup, epsilon_0, L_0, M, X, y) {
hmc_run # # Args:
# starts: starting values
# iter: total number of simulations for each chain (note chain is based on the dimension of starts)
# warmup: determines which of the initial iterations will be ignored for inference purposes
# epsilon0: the baseline stepsize
# L0: the baseline number of leapfrog steps
# M: is the mass vector
= nrow(starts)
chains = ncol(starts)
d = array(NA,
sims c(iter, chains, d),
dimnames = list(NULL, NULL, colnames(starts)))
= matrix(NA, iter, chains)
p_jump
for (j in 1:chains) {
= starts[j,]
th
for (t in 1:iter) {
= runif(1, 0, 2*epsilon_0)
epsilon = ceiling(2*L_0*runif(1))
L
= hmc_iteration(X, y, th, epsilon, L, M)
temp
= temp$p_jump
p_jump[t,j] = temp$th
sims[t,j,]
= temp$th
th
}
}
# acceptance rate
= round(colMeans(p_jump[(warmup + 1):iter,]), 3)
acc
message('Avg acceptance probability for each chain: ',
paste0(acc[1],', ',acc[2]), '\n')
list(sims = sims, p_jump = p_jump)
}
Estimation
With the primary functions in place, we set the starting values and choose other settings for for the HMC process. The coefficient starting values are based on random draws from a uniform distribution, while \(\sigma\) is set to a value of one in each case. As in the other examples we’ll have 5000 total draws with warm-up set to 2500. I don’t have any thinning option here, but that could be added or simply done as part of the coda package preparation.
# Starting values and mcmc settings
= c(paste0('beta[', 1:4, ']'), 'sigma')
parnames = length(parnames)
d
= 2
chains
= t(replicate(chains, c(runif(d-1, -1, 1), 1)))
theta_start colnames(theta_start) = parnames
= 1000
nsim = 500 wu
We can fiddle with these sampling parameters to get a desirable acceptance rate of around .80. The following work well with the data we have.
= .08
stepsize = 10
nLeap = rep(1, 5)
vars = 1 / vars mass_vector
We are now ready to run the model. On my machine and with the above settings, it took a couple seconds. Once complete we can use the coda package if desired as we have done before.
# Run the model
= hmc_run(
fit_hmc starts = theta_start,
iter = nsim,
warmup = wu,
epsilon_0 = stepsize,
L_0 = nLeap,
M = mass_vector,
X = X,
y = y
)# str(fit_hmc, 1)
Using coda, we can get some nice summary information. Results not shown.
library(coda)
= as.mcmc.list(list(as.mcmc(fit_hmc$sims[(wu+1):nsim, 1,]),
theta as.mcmc(fit_hmc$sims[(wu+1):nsim, 2,])))
# summary(theta)
= summary(theta)$statistics[,'Mean']
fit_summary
= fit_summary[1:4]
beta_est = fit_summary[5]
sigma_est
# log_posterior(X, y, fit_summary)
Instead we can use rstan’s monitor function on fit_hmc$sims
to produce typical Stan output.
parameter | mean | sd | 2.5% | 97.5% | n_eff | Rhat | Bulk_ESS | Tail_ESS |
---|---|---|---|---|---|---|---|---|
beta[1] | 4.898 | 0.128 | 4.645 | 5.139 | 505 | 1.006 | 521 | 305 |
beta[2] | 0.077 | 0.133 | -0.192 | 0.334 | 560 | 1.012 | 566 | 240 |
beta[3] | -1.472 | 0.122 | -1.710 | -1.224 | 628 | 1.010 | 630 | 457 |
beta[4] | 0.825 | 0.111 | 0.612 | 1.042 | 618 | 1.000 | 630 | 569 |
sigma | 2.013 | 0.090 | 1.844 | 2.204 | 682 | 1.004 | 693 | 425 |
Comparison
Our estimates look pretty good, and inspection of the diagnostics would show good mixing and convergence as well. At this point we can compare it to the Stan output. For the following, I use the same inverse gamma prior and tweaked the control options for a little bit more similarity, but that’s not necessary.
data { // Data block
int<lower = 1> N; // Sample size
int<lower = 1> K; // Dimension of model matrix
matrix [N, K] X; // Model Matrix
vector[N] y; // Target variable
}
parameters { // Parameters block; declarations only
vector[K] beta; // Coefficient vector
real<lower = 0> sigma; // Error scale
}
model { // Model block
vector[N] mu;
mu = X * beta; // Creation of linear predictor
// priors
beta ~ normal(0, 10);
sigma ~ inv_gamma(.001, .001); // changed to gamma a la code above
// likelihood
y ~ normal(mu, sigma);
}
Here are the results.
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
beta[1] | 1.421 | 3.336 | -1.761 | 5.095 |
beta[2] | 1.075 | 0.818 | -0.121 | 1.710 |
beta[3] | 0.220 | 1.474 | -1.667 | 1.480 |
beta[4] | -0.094 | 0.786 | -0.738 | 1.013 |
sigma | 2.034 | 0.064 | 1.889 | 2.199 |
And finally, the standard least squares fit (Residual standard error = sigma).
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 4.898 | 0.1284 | 38.13 | 3.026e-105 |
X1 | 0.08408 | 0.1296 | 0.6488 | 0.5171 |
X2 | -1.469 | 0.1261 | -11.64 | 3.049e-25 |
X3 | 0.8196 | 0.1207 | 6.793 | 8.207e-11 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
250 | 2.021 | 0.4524 | 0.4458 |
Source
Original demo here:
https://m-clark.github.io/bayesian-basics/appendix.html#hamiltonian-monte-carlo-example