Preface

The following is a list of Markov Chain Monte Carlo algorithms that was formerly found on the now-defunct website bayesian-inference.com. That site was also home to the once useful LaplacesDemon R package, which was evidently abandoned, but has recently been revived here, and is even available on CRAN again (link). The ambitious goal of LaplacesDemon was to provide a complete Bayesian inference environment within R. When first getting into Bayesian analysis in R, I found its vignettes clear and the website a nice resource. The vignettes are available on CRAN and below, and reflect current development.

One of the useful things on the LD website that I always meant to get back to and spend more time with was the list of algorithms that the package could utilize. As others might also be interested, I have reproduced that here, culled from the depths of the Wayback Machine. While some things might be specific to the R package, the considerations and comparisons, as well as relative strengths, weaknesses and other description, as well as references, are informative in my opinion. The references are probably dated to around 2014, and may not be entirely as accurate as they were then, but they appear to have been very up to date to that point. I also appreciate the historical context provided here and there, and while I think the descriptions will be fine for getting a general sense of the algorithms for some time, they obviously will not reflect any recent developments.

In reproducing this, I have tried to keep an eye out for things like subscripts and other notation, but didn’t spend a lot of time on that. I also cleaned up any typos and other issues I may have come across. I have no real intention doing much more with this, I just provide it out of my own interest. All credit beyond this preface goes to Byron Hall and Statisticat LLC.

Bayesian Inference

Model Examples

Single Page version (easier to search; possibly better for small screens)

Originals (zipped html files; rmd files for this document can be found here)



Markov Chain Monte Carlo

(MCMC)


Markov chain Monte Carlo (MCMC) algorithms, also called samplers, are numerical approximation algorithms. There are a large number of MCMC algorithms, too many to review here. Popular families (which are often non-distinct) include Gibbs sampling, Metropolis-Hastings, slice sampling, Hamiltonian Monte Carlo, and many others. Though the name is misleading, Metropolis-within-Gibbs (MWG) was developed first (Metropolis, Rosenbluth, and Teller 1953), and Metropolis-Hastings was a generalization of MWG (Hastings 1970). All MCMC algorithms are known as special cases of the Metropolis-Hastings algorithm. Regardless of the algorithm, the goal in Bayesian inference is to maximize the unnormalized joint posterior distribution and collect samples of the target distributions, which are marginal posterior distributions, later to be used for inference.

The most generalizable MCMC algorithm is the Metropolis-Hastings (MH) generalization of the MWG algorithm. The MH algorithm extended MWG to include asymmetric proposal distributions. For years, the main disadvantage of the MWG and MH algorithms was that the proposal variance (see below) had to be tuned manually, and therefore other MCMC algorithms have become popular because they do not need to be tuned.

Gibbs sampling became popular for Bayesian inference, though it requires conditional sampling of conjugate distributions, so it is precluded from non-conjugate sampling in its purest form. Gibbs sampling also suffers under high correlations (Gilks and Roberts 1996). Due to these limitations, Gibbs sampling is less generalizable than random-walk Metropolis (RWM), though RWM and other algorithms are not immune to problems with correlation. The original slice sampling algorithm of Neal (1997) is a special case of Gibbs sampling that samples a distribution by sampling uniformly from the region under the plot of its density function, and is more appropriate with bounded distributions that cannot approach infinity, though the improved Slice sampler of Neal (2003) is available here.

Blockwise Sampling

Usually, there is more than one target distribution, in which case it must be determined whether it is best to sample from target distributions individually, in groups, or all at once. Block updating refers to splitting a multivariate vector into groups called blocks, and each block is sampled separately. A block may contain one or more parameters.

Parameters are usually grouped into blocks such that parameters within a block are as correlated as possible, and parameters between blocks are as independent as possible. This strategy retains as much of the parameter correlation as possible for blockwise sampling, as opposed to componentwise sampling where parameter correlation is often ignored. The PosteriorChecks function can be used on the output of previous runs to find highly correlated parameters, and the Blocks function may be used to create blocks based on posterior correlation.

Advantages of blockwise sampling are that a different MCMC algorithm may be used for each block (or parameter, for that matter), creating a more specialized approach (though different algorithms by block are not supported here), the acceptance of a newly proposed state is likely to be higher than sampling from all target distributions at once in high dimensions, and large proposal covariance matrices can be reduced in size, which is most helpful again in high dimensions.

Disadvantages of blockwise sampling are that correlations probably exist between parameters between blocks, and each block is updated while holding the other blocks constant, ignoring these correlations of parameters between blocks. Without simultaneously taking everything into account, the algorithm may converge slowly or never arrive at the proper solution. However, there are instances when it may be best when everything is not taken into account at once, such as in state-space models. Also, as the number of blocks increases, more computation is required, which slows the algorithm. In general, blockwise sampling allows a more specialized approach at the expense of accuracy, generalization, and speed.

Blockwise sampling is offered in the following algorithms: Adaptive Metropolis-within-Gibbs (AMWG), Adaptive-Mixture Metropolis (AMM), Automated Factor Slice Sampler (AFSS), Elliptical Slice Sampler (ESS), Hit-And-Run Metropolis (HARM), Metropolis-within-Gibbs (MWG), Random-Walk Metropolis (RWM), Robust Adaptive Metropolis (RAM), and the Univariate Eigenvector Slice Sampler (UESS).

Markov chain Properties

Only the basics of Markov chain properties are introduced here. A Markov chain is Markovian when the current iteration depends only on the previous iteration. Many (but not all) adaptive algorithms are merely chains but not Markov chains when the adaptation is based on the history of the chains, not just the previous iteration. A Markov chain is said to be aperiodic when it is not repeating a cycle. A Markov chain is considered irreducible when it is possible to go from any state to any other state, though not necessarily in one iteration. A Markov chain is said to be recurrent if it will eventually return to a given state with probability 1, and it is positive recurrent if the expected return time is finite, and null recurrent otherwise. The ergodic theorem states that a Markov chain is ergodic when it is aperiodic, irreducible, and positive recurrent.

The non-Markovian chains of an adaptive algorithm that adapt based on the history of the chains should have two conditions: containment and diminishing adaptation. Containment is difficult to implement. The condition of diminishing adaptation is fulfilled when the amount of adaptation diminishes with the length of the chain. Diminishing adaptation can be achieved when the proposal variances become smaller or by decreasing the probability of performing adaptations with more iterations (Roberts and Rosenthal 2007).

Algorithm Summary Table

Each algorithm, presented above, has an algorithm summary table that is meant to provide a quick overview and facilitate comparisons between algorithms. Below is a description of the contents of these tables.

Aspect Description
Acceptance Rate The acceptance rate is a consideration of the number of accepted proposals and the number of rejected proposals. It is usually a ratio of the number of proposals to the number of iterations. When known, the optimal acceptance rate and recommended interval is reported.
Applications Since no algorithm is suitable to every problem, this is meant to provide guidance to the applicability of the algorithm.
Difficulty The difficulty of using the algorithm is considered for a beginner. Considerations include the number of algorithm specifications, tuning the algorithm, whether or not it may be used as a final algorithm, and general experiences with use.
Final Algorithm? This description is usually ‘Yes’, ‘No’, or ‘User Discretion’, followed by an explanation. When a ‘final algorithm’ seems to have converged, the samples may be used for inference. When it is not a final algorithm, then one is suggested.
Proposal This describes how the proposal is generated, and is usually ‘Multivariate’ or ‘Componentwise’, though other descriptions exist as well. In general, a multivariate proposal means that each iteration, a proposal is generated for all parameters that takes correlation into account, usually from a multivariate normal distribution and a proposal covariance matrix. Componentwise proposals usually indicate that a proposal is made for each parameter, without considering correlation. A componentwise algorithm must evaluate the model a number of times equal to the number of parameters, per iteration. Componentwise algorithms are consierably slower per iteration, and speed per iteration decreases as the number of parameters increases.


Stopping Rules

After a MCMC algorithm is selected, a model is specified and the priors are updated given the likelihood, and numerous iterations. A question with MCMC is: when to stop iterating/updating? Many stopping rules have been proposed. The most popular single-chain stopping rule involves Monte Carlo Standard Error (MCSE). An alternative when this is too difficult may be when effective sample size (ESS) reaches a target. The most popular parallel-chain stopping rule uses the Gelman-Rubin Diagnostic. MCMC Diagnostics

After the user decides to stop iterating MCMC and updating the model, MCMC diagnostics are performed to assess non-convergence. Markov (and non-Markovian) Chains cannot be confirmed to have converged, so diagnostics are used to look for signs of non-convergence. Numerous MCMC diagnostics have been proposed. It is customary to observe trace plots and autocorrelation plots of the chains, assess stationarity with the BMK Diagnostic, and search for signs of non-convergence with other diagnostics. The LaplacesDemon software package offers a variety of MCMC diagnostics.

References

  • Gilks W, Roberts G (1996). “Strategies for Improving MCMC.” In W Gilks, S Richardson, D Spiegelhalter (eds.), Markov Chain Monte Carlo in Practice, p. 89-114. Chapman & Hall, Boca Raton, FL.
  • Hastings W (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications.” Biometrika, 57(1), 97-109.
  • Metropolis N, Rosenbluth A, MN R, Teller E (1953). “Equation of State Calculations by Fast Computing Machines.” Journal of Chemical Physics, 21, 1087-1092.
  • Neal R (1997). “Markov Chain Monte Carlo Methods Based on Slicing the Density Function.” Technical Report, University of Toronto.
  • Neal R (2003). “Slice Sampling (with Discussion).” Annals of Statistics, 31(3), 705-767.
  • Roberts G, Rosenthal J (2007). “Coupling and Ergodicity of Adaptive Markov Chain Monte Carlo Algorithms.” Journal of Applied Probability, 44, 458-475.

Adaptive Directional Metropolis-within-Gibbs

(ADMG)


Aspect Description
Acceptance Rate The optimal acceptance rate is 44%, and is based on the univariate normality of each marginal posterior distribution. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is relatively easy for a beginner. It has only two algorithm specifications. However, since it is adaptive, the user must regard diminishing adaptation.
Final Algorithm? User Discretion. The MWG algorithm is recommended as the final algorithm, though ADMG may be used if diminishing adaptation occurs and adaptation ceases effectively.
Proposal Componentwise.


The Adaptive Directional Metropolis-within-Gibbs (ADMG) algorithm was proposed by Bai (2009) as an extension of the Adaptive Metropolis-within-Gibbs (AMWG) algorithm in which the direction and jumping distance of the componentwise proposal is affected by an estimate of the historical covariance matrix.

ADMG has two algorithm specifications: n is the number of previous iterations and Periodicity is the frequency in iterations of adaptation. The minimum Periodicity is equal to the number of parameters, and should probably be greater.

Although ADMG is a componentwise algorithm, it does not ignore correlation of parameters, as is typical with componentwise algorithms. ADMG makes each componentwise proposal based on an estimate of the historical covariance matrix. Adaptation is based on the singular value decomposition (SVD) of the estimate of the historical covariance matrix. In large dimensions, SVD is computationally expensive, and larger integers are recommended for the algorithm specifications. If SVD results in a singularity, then the algorithm defaults to a simple Metropolis-within-Gibbs MWG proposal. Otherwise, when SVD is successful, then MWG is performed under the rotated coordinates.

The author notes that the MWG part of the algorithm may proceed by deterministic-scan (which he refers to as system-scan) and random-scan. The algorithm implemented here uses random-scan, and is denoted by the author as ADRSMG. Random-scan updates means that the order of the update of the parameters is randomized each iteration.

Compared to other adaptive algorithms with multivariate proposals, a disadvantage is the time to complete each iteration increases as a function of parameters and model complexity, as noted in MWG. For example, in a 100-parameter model, ADMG completes its first iteration as the AMM algorithm completes its 100th. However, ADMG is more efficient with information per iteration.

The advantage of ADMG to other componentwise algorithms (such as AMWG, MWG, RDMH, and Slice) is that ADMG makes each proposal while accounting for parameter correlation, while other componentwise algorithms traditionally ignore parameter correlation. The author asserts that SVD-based adaptation is more efficient than adapting directly to the historical covariance matrix or an estimate thereof. The disadvantage of ADMG to other componentwise algorithms is that SVD becomes computationally expensive in large dimensions. Since ADMG is adaptive, MWG should be used as the final algorithm.

References

  • Bai Y (2009). “An Adaptive Directional Metropolis-within-Gibbs Algorithm.” Technical Report in Department of Statistics at the University of Toronto.

See also

Automated Factor Slice Sampler

(AFSS)


Aspect Description
Acceptance Rate The acceptance rate is 1.
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is easy for a beginner. There are several algorithm specifications, and the defaults are recommended for automatic use.
Final Algorithm? Yes, when A=0. See below.
Proposal Multivariate proposals with componentwise evaluation.


The Automated Factor Slice Sampler of Tibbits et al. (2014) is an extension of the componentwise slice sampler in which linear correlation between parameters is accounted for and the interval widths are tuned. This results in a componentwise slice sampler that is faster than a traditional componentwise slice sampler because less evaluations are necessary to estimate the slice boundaries, and it samples efficiently in high-dimensional models with continuous parameters.

AFSS has five algorithm specifications: A accepts a scalar that indicates the iteration in which adaptation will cease, and defaults to Inf. B is for blockwise sampling, defaults to NULL, and otherwise accepts a list of parameter positions, in which each list component is a block. Blocks are processed sequentially, but the order of parameters is selected randomly within each block, and each block must contain at least two parameters. The m specification accepts either a scalar integer or a vector equal in length to the number of parameters that indicates the maximum number of steps. It defaults to 100 and must be in [1, ∞]. The n specification accepts a scalar that indicates the total previous number of adaptive iterations, and defaults to 0. Finally, w accepts either a scalar or a vector equal in length to the number of parameters, and is initial step size, which defaults to 1 and must be in (0, ∞).

AFSS performs two kinds of adaptation. While adaptive, each iteration AFSS collects an approximation of the covariance of the parameters via a scatter matrix. Periodically, eigenvectors are estimated from the approximated covariance matrix. More parameters results in less frequent adaptation. Each iteration, componentwise evaluation is performed as per a componentwise sampler. However, after the first eigenvalue estimation, all parameters are updated for each componentwise evaluation, which results in a sampler that is not truly multivariate or componentwise. Each parameter is updated in turn, but all other parameters are also updated due to linear correlations with the current parameter. Also, after each estimation of eigenvalues, the interval widths are tuned toward an equal probability of interval expansion during the step-out phase and interval shrinkage during rejection sampling. As the number of parameters increases, so does the computational cost of estimating eigenvalues, and blockwise sampling becomes more practical in high-dimensional models.

While AFSS is efficient with linear relationships, it may be inefficient due to nonlinear relationships. The AFSS algorithm is slower per iteration than multivariate samplers, but typically improves more per iteration than most algorithms. To date, testing indicates AFSS is remarkably efficient. AFSS is acceptable as a final algorithm when non-adaptive samples are drawn.

References

  • Tibbits M, Groendyke C, Haran M, and Liechty J (2014). “Automated Factor Slice Sampling”. Journal of Computational and Graphical Statistics, 23(2), 543-563.

See Also

Adaptive Griddy-Gibbs

(AGG)


Aspect Description
Acceptance Rate The acceptance rate is 1.
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is easy to use because the scale of the grid is adaptively tuned. There is only one required algorithm specification for defining the grid. Additional optional algorithm specifications control parallelization, allow discrete parameters, or prevent extreme values.
Final Algorithm? Yes.
Proposal Componentwise.


The Adaptive Griddy-Gibbs (AGG) sampler is an extension of the Griddy-Gibbs (GG) sampler in which the scale of the centered grid is tuned each iteration for continuous parameters. The scale is calculated as the standard deviation of the conditional distribution. If discrete parameters are used, then the grid of discrete parameters is not tuned, since all discrete values are evaluated.

AGG has six algorithm specifications: Grid is a vector or list of points in the grid, dparm is a vector that indicates discrete parameters and defaults to NULL, smax is the maximum allowable conditional standard deviation, CPUs is the number of available central processing units (CPUs), Packages is a vector of package names, and Dyn.lib is a vector of shared libraries. The default for Grid is GaussHermiteQuadRule(3)$nodes. For each continuous parameter in each iteration, the scaled grid values are added to the latest value of the parameter, and the model is evaluated with the parameter at each point on the grid and a sample is taken. For each discrete parameter, the model is evaluated at each grid point and a sample is taken.

The points in the grid for continuous parameters are selected by the user as the nodes of a Gauss-Hermite quadrature rule (with the GaussHermiteQuadRule function), and these points are not evenly-spaced. Many problems require as few as 3 nodes, while others require perhaps as many as 99 nodes. When the number of nodes is larger than necessary, time is wasted in computation, fewer points occur within most of the density, and occasional extreme values of LP are observed, but the chains reach the target distributions in fewer iterations due to the wider grid. When the number of nodes is too small, it may not converge on the correct distribution. It is therefore suggested to begin with a small, odd number of grid points, such as 3, to reduce computation time. An odd number of grid points is preferred. If occasional extreme values of LP are observed, set smax to something reasonable. If smax is too small, then higher autocorrelation will occur and more thinning is necessary.

Advantages of parallelized AGG over most componentwise algorithms is that it yields a 100% acceptance rate, and draws from the approximated distribution should be less autocorrelated. A disadvantage is that more model evaluations are required, and even if a parallel environment had zero overhead, AGG would be twice as slow per iteration as other componentwise algorithms. AGG is adaptive in the sense of self-adjusting, but not in the sense of being non-Markovian, so it is suitable as a final algorithm.

References

  • none listed

See Also

Adaptive Hamiltonian Monte Carlo

(AHMC)


Aspect Description
Acceptance Rate The optimal acceptance rate is 65% when L > 1, or 57.4% when L = 1. The observed acceptance rate may be suitable in the interval [60%,70%] when L > 1, or [40%,80%] when L = 1.
Applications This is a widely applicable, general-purpose algorithm that is best suited to models with a small number of parameters. The number of model evaluations per iteration increases with the number of parameters.
Difficulty This algorithm is relatively difficult for a beginner. It has four algorithm specifications. Even though it is adaptive, it is difficult to tune. Since it is an adaptive algorithm, the user must regard diminishing adaptation.
Final Algorithm? User Discretion. The HMC algorithm is recommended as the final algorithm, though AHMC may be used if diminishing adaptation occurs and adaptation ceases effectively.
Proposal Multivariate. Proposals are multivariate only in the sense that proposals for multiple parameters are generated at once. However, proposals are not generated with a multivariate distribution and a proposal covariance matrix. Each iteration involves numerous proposals, due to partial derivatives and L.


This is an adaptive form of Hamiltonian Monte Carlo (HMC) called Adaptive Hamiltonian Monte Carlo (AHMC). In AHMC, the ε parameter is adapted, though L and M are not. When adapting, and considering K parameters, AHMC multiplies εk by 0.8 when a proposal for parameter sk has not been accepted in the last 10 iterations, or multiplies it by 1.2 when a proposal has been accepted at least 8 of the last 10 iterations, as suggested by Neal (2011).

AHMC has four algorithm specifications: epsilon or ε is the step-size, L is the number of leapfrog steps, M is an optional mass matrix, and Periodicity is the frequency in iterations for adaptation of ε.

Although AHMC generates multivariate proposals, parameter correlation is not taken into account unless a mass matrix is used. Some sources refer to a mass matrix in HMC as a covariance matrix, and some as a precision matrix. Here, the mass matrix is a covariance matrix. It is difficult to tune for several reasons. The optimal mass matrix is different with different configurations of the parameters, so what was optimal at one point is sub-optimal at another. It is not recommended to substitute the historical covariance matrix of samples while pursuing equilibrium.

As with HMC, the AHMC algorithm is slower per iteration than many other algorithms, but often produces chains with good mixing when well-tuned. An alternative to AHMC that should perform better is HMCDA. AHMC is more consistent with respect to time per iteration, because L remains constant, than HMCDA and NUTS, which may have some iterations that are much slower than others. If AHMC is used for adaptation, then the final, non-adaptive algorithm should be HMC. References

References

  • Neal R (2011). “MCMC for Using Hamiltonian Dynamics.” In S Brooks, A Gelman, G Jones, M Xiao-Li (eds.), Handbook of Markov Chain Monte Carlo, p. 113-162. Chapman & Hall, Boca Raton, FL.

See Also

Affine-Invariant Ensemble Sampler

(AIES)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is relatively easy for a beginner. The user must select the number of walkers, though the recommended default is suitable, must supply a vector of initial values for each walker, and specify a scale parameter, which again has a suggested default. If the user experiences difficulty, the recommendation is to increase the number of walkers.
Final Algorithm? Yes.
Proposal Multivariate by Walker.


The Affine-Invariant Ensemble Sampler (AIES) of Goodman and Weare (2010) uses a complementary ensemble of at least 2J walkers for J parameters. Each walker receives J initial values, and initial values must differ for each walker. At each iteration, AIES makes a multivariate proposal for each walker, given a scaled difference in position by parameter between the current walker and another randomly-selected walker.

AIES has six algorithm specifications: Nc is the number of walkers, Z is a \(N_c\) x \(J\) matrix of initial values, β is a scale parameter, CPUs is the number of central processing units (CPUs), Packages is a vector of package names, and Dyn.lib is a vector of shared libraries. The recommended number of walkers is at least \(2J\). If separate sets of initial values are not supplied in Z, since Z defaults to NULL, then the GIV function is used to generate initial values. The original article referred to the scale parameter as α, though it has been renamed here to β to avoid a conflict with the acceptance probability α in the Metropolis step. The β parameter may be manipulated to affect the desired acceptance rate, though in practice, the acceptance rate may potentially be better affected by increasing the number of walkers. It is recommended to specify CPUs=1 and leave the remaining arguments to NULL, unless needed.

This version returns the samples from one walker, and the other walkers assist the main walker. A disadvantage of this approach is that all samples from all walkers are not returned. An advantage of this approach is that if a particular walker is an outlier, then it does not affect the main walker, unless of course it is the main walker! Multiple sets of samples are best returned in a list, such as with parallel chains in the LaplacesDemon.hpc function, though it is not applicable in LaplacesDemon.

AIES has been parallelized by Foreman-Mackey et al. (2012), and this style of parallelization is available here as well. The user is cautioned to prefer CPUs=1, because parallelizing may result in a slower algorithm due to communication between the master and slave CPUs each iteration. This communication is costly, and is best overcome with a large number of CPUs available, and when the Model function is slow to evaluate in comparison to network communication time.

Both the parallelized (CPUs > 1) and un-parallelized (CPUs=1) versions should be called from LaplacesDemon, not LaplacesDemon.hpc. When parallelized, the number of walkers must be an even number (odd numbers are not permitted), and the walkers are split into two equal-sized batches. Each iteration, each walker moves in relation to a randomly-selected walker in the other batch. This retains detailed balance.

AIES is attractive for offering affine-invariance, and therefore being generally robust to badly scaled posteriors, such as with highly correlated parameters. It is also attractive for making a multivariate proposal without a proposal covariance matrix. However, since at least 2J walkers are recommended, the number of model evaluations per iteration exceeds most componentwise algorithms by at least twice, making AIES a slow algorithm per iteration, and computation scales poorly with model dimension. Large-scale computing environments may overcome this limitation with parallelization, but parallelization is probably not very helpful (and may be detrimental) in small-scale computing environments when evaluating the model function is not slow in comparison with network communication time. AIES is not an adaptive algorithm, and is therefore suitable as a final algorithm.

References

  • Foreman-Mackey D, Hogg D, Lang D, Goodman J (2012). “emcee: The MCMC Hammer.” Upcoming in Publications of the Astronomical Society of the Pacific, http://arxiv.org/abs/1202.3665.
  • Goodman J, Weare J (2010). “Ensemble Samplers with Affine Invariance.” Communications in Applied Mathematics and Computational Science, 5(1), 6-80.

See Also

Adaptive Metropolis

(AM)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm that is best suited to models with a small to medium number of parameters. The proposal covariance matrix must be solved, and this matrix grows with the number of parameters.
Difficulty This algorithm is relatively easy for a beginner. It has few algorithm specifications, and these are easy to specify. However, since it is adaptive, the user must regard diminishing adaptation.
Final Algorithm? User Discretion. The RWM algorithm is recommended as the final algorithm, though AM may be used if diminishing adaptation occurs and adaptation ceases effectively.
Proposal Multivariate.


The Adaptive Metropolis (AM) algorithm of Haario et al. (2001) is an extension of Random-Walk Metropolis (RWM) that adapts based on the observed covariance matrix from the history of the chains. AM is historically significant as the first adaptive MCMC algorithm.

AM has two algorithm specifications: Adaptive is the iteration at which adaptation begins, and Periodicity is the frequency in iterations of adaptation. The user should not allow AM to adapt immediately, since AM adapts based on the observed covariance matrix of historical and accepted samples. Since enough samples are needed to obtain a valid covariance matrix before adaptation, a small covariance matrix is often used initially to encourage a high acceptance rate.

As recommended by Haario et al. (2001), there are two tricks that may be used to assist the AM algorithm in the beginning. Although the suggested “greedy start” method is not used here, the second suggested trick is used, which is to shrink the proposal as long as the acceptance rate is less than 5%, and there have been at least five acceptances. Haario et al. (2001) suggest loosely that if “it has not moved enough during some number of iterations, the proposal could be shrunk by a constant factor”. For each iteration that the acceptance rate is less than 5% and that the AM algorithm is used but the current iteration is prior to adaptation, Laplace’s Demon multiplies the proposal covariance or variance by (1 - 1/Iterations). Over pre-adaptive time, this encourages a smaller proposal covariance or variance to increase the acceptance rate so that when adaptation begins, the observed covariance or variance of the chains will not be constant, and then shrinkage will cease and adaptation will take it from there.

AM is best suited for a model with small or medium dimensions (number of parameters). The Adaptive-Mixture Metropolis (AMM) of Roberts and Rosenthal (2009) and Robust Adaptive Metropolis (RAM) of Vihola (2011) are extensions of the AM algorithm. If AM is used for adaptation, then the final, non-adaptive algorithm should be RWM.

References

  • Haario H, Saksman E, Tamminen J (2001). “An Adaptive Metropolis Algorithm.” Bernoulli, 7(2), 223-242.
  • Roberts G, Rosenthal J (2009). “Examples of Adaptive MCMC.” Computational Statistics and Data Analysis, 18, 349-367.
  • Vihola M (2011). “Robust Adaptive Metropolis Algorithm with Coerced Acceptance Rate.” In Forthcoming (ed.), Statistics and Computing, 1-12. Springer, Netherlands.

See Also

Adaptive-Mixture Metropolis

(AMM)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm that is best suited to models with a small to medium number of parameters, or a larger number of blocked parameters. The proposal covariance matrix must be solved, and this matrix grows with the number of parameters.
Difficulty This algorithm is relatively easy for a beginner. The algorithm specifications are easy to specify. However, since it is adaptive, the user must regard diminishing adaptation.
Final Algorithm? User Discretion. The RWM algorithm is recommended as the final algorithm, though AMM may be used if diminishing adaptation occurs and adaptation ceases effectively.
Proposal Blockwise or Multivariate.


The Adaptive-Mixture Metropolis (AMM) algorithm is an extension by Roberts and Rosenthal (2009) of the Adaptive Metropolis (AM) algorithm of Haario et al. (2001). AMM differs from the AM algorithm in two respects. First, AMM updates a scatter matrix based on the cumulative current parameters and the cumulative associated outer-products, and these are used to generate a multivariate normal proposal. This is more efficient with large numbers of parameters adapting over many iterations, especially with frequent adaptations, and results in a much faster algorithm. The second (and main) difference, is that the proposal is a mixture. The two mixture components are adaptive multivariate and static/symmetric univariate proposals. The mixture is determined at each iteration with a mixture weight. The mixture weight must be in the interval (0,1], and it defaults to 0.05, as in Roberts and Rosenthal (2009). A higher value of the mixture weight is associated with more static/symmetric univariate proposals, and a lower weight is associated with more adaptive multivariate proposals. The algorithm will be unable to include the multivariate mixture component until it has accumulated some history, and models with more parameters will take longer to be able to use adaptive multivariate proposals.

AMM has five algorithm specifications: Adaptive is the iteration in which adaptation begins, B optionally accepts a list of blocked parameters and defaults to NULL, n is the number of previous iterations, Periodicity is the frequency in iterations of adaptation, and w is the weight of the small-variance mixture component. B allows the user to organize parameters into blocks. B accepts a list, in which each component is a block and accepts a vector that consists of numbers that point to the associated parameters in parm.names. B defaults to NULL, in which case blocking does not occur. When blocking does occur, the proposal covariance matrix may be either NULL or a list in which each component is the covariance matrix for a block. As more blocks are added, the algorithm becomes closer to Adaptive Metropolis-within-Gibbs (AMWG).

The advantages of AMM over AMWG (when B=NULL) are that it takes correlation into account as it adapts, and is much faster to update each iteration. The disadvantages are that AMWG does not require a burn-in period before it can begin to adapt, and more information must be learned in the covariance matrix to adapt properly. Disadvantages of AMM compared to Robust Adaptive Metropolis (RAM) are that RAM does not require a burn-in period before it can begin to adapt, RAM is more likely to better handle multimodal or heavy-tailed targets, and RAM also adapts to the shape of the target distributions and coerces the acceptance rate. If AMM is used for adaptation, then the final, non-adaptive algorithm should be Random-Walk Metropolis (RWM).

References

  • Haario H, Saksman E, Tamminen J (2001). “An Adaptive Metropolis Algorithm.” Bernoulli, 7(2), 223-242.
  • Roberts G, Rosenthal J (2009). “Examples of Adaptive MCMC.” Computational Statistics and Data Analysis, 18, 349-367.

See Also

Adaptive Metropolis-within-Gibbs

(AMWG)


Aspect Description
Acceptance Rate The optimal acceptance rate is 44%, and is based on the univariate normality of each marginal posterior distribution. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is relatively easy for a beginner. It has three algorithm specifications, which are easy to use. However, since it is adaptive, the user must regard diminishing adaptation.
Final Algorithm? User Discretion. The MWG algorithm is recommended as the final algorithm, though AMWG may be used if diminishing adaptation occurs and adaptation ceases effectively.
Proposal Componentwise.


The Adaptive Metropolis-within-Gibbs (AMWG) algorithm is presented in (Roberts and Rosenthal 2009; Rosenthal 2007). It is an adaptive version of Metropolis-within-Gibbs (MWG).

AMWG has three algorithm specifications: B is optional and allows blockwise sampling. It defaults to NULL, but may accept a list in which each list component is a block of parameters. If blockwise sampling is used, then blocks are updated sequentially, and the order of parameters within blocks is randomized. The n specification defaults to 0 and is used to keep track of how many previous adaptive iterations were run. The Periodicity specification indicates the frequency in iterations at which the algorithm adapts. If Periodicity is set too low, such as Periodicity=1, the algorithm may adapt too quickly to poor information in the beginning, and become unstable. Periodicity=50 is recommended.

In AMWG, the standard deviation of the proposal of each parameter is manipulated to optimize the associated acceptance rate toward 0.44. This is much simpler than other adaptive methods that adapt based on sample covariance in large dimensions. Large covariance matrices require a large number of elements to adapt, which takes exponentially longer to adapt as the dimension increases. Regardless of dimension, the AMWG optimizes each parameter to a univariate acceptance rate, and a sample covariance matrix does not need to be estimated for adaptation, which consumes time and memory. The order of the parameters for updating is randomized each iteration (random-scan AMWG), as opposed to sequential updating (deterministic-scan AMWG).

Compared to other adaptive algorithms with multivariate proposals, a disadvantage is the time to complete each iteration increases as a function of parameters and model complexity, as noted in MWG. For example, in a 100-parameter model, AMWG completes its first iteration as the Adaptive-Mixture Metropolis (AMM) algorithm completes its 100th. However, to adapt accurately, the AMM algorithm must correctly estimate 5,050 elements of a sample covariance matrix, while AMWG must correctly estimate only 100 proposal standard deviations. Roberts and Rosenthal (2009) have shown an example model with 500 parameters that had a burn-in of around 25,000 iterations.

The advantages of AMWG over AMM are that AMWG does not require a burn-in period before it can begin to adapt, and that AMWG does not need to estimate a covariance matrix to adapt properly. The disadvantages of AMWG compared to AMM are that correlation can be problematic since it is not taken into account with a proposal covariance matrix, and AMWG solves the model function once per parameter per iteration, which can be unacceptably slow with large or complicated models. The advantage of AMWG over Robust Adaptive Metropolis (RAM) is that AMWG does not need to estimate a covariance matrix to adapt properly. The disadvantages of AMWG compared to RAM are AMWG is less likely to handle multimodal or heavy-tailed targets, and AMWG solves the model function once per parameter per iteration, which can be unacceptably slow with large or complicated models. If AMWG is used for adaptation, then the final, non-adaptive algorithm should be MWG.

Bai (2009) extended AMWG to the Adaptive Directional Metropolis-within-Gibbs (ADMG) algorithm.

References

  • Bai Y (2009). “An Adaptive Directional Metropolis-within-Gibbs Algorithm.” Technical Report in Department of Statistics at the University of Toronto.
  • Roberts G, Rosenthal J (2007). “Coupling and Ergodicity of Adaptive Markov Chain Monte Carlo Algorithms.” Journal of Applied Probability, 44, 458-475.
  • Roberts G, Rosenthal J (2009). “Examples of Adaptive MCMC.” Computational Statistics and Data Analysis, 18, 349-367.

See Also

Componentwise Hit-And-Run Metropolis

(CHARM)


Aspect Description
Acceptance Rate The optimal acceptance rate is 44%, and is based on the univariate normality of each marginal posterior distribution. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is easy for a beginner. There are no algorithm specifications (although one is optional), and tuning is unnecessary.
Final Algorithm? Yes.
Proposal Componentwise.


The Hit-And-Run algorithm is a variation of RWM that has been around as long as Gibbs sampling. Hit-And-Run randomly samples a direction on the unit sphere, and a proposal is made for each parameter in its randomly-selected direction for a uniformly-distributed distance (Gilks and Roberts, 1996). This version of Hit-And-Run, called Componentwise Hit-And-Run Metropolis (CHARM), includes componentwise proposals and a Metropolis step for rejection. Introduced by Turchin (1971) along with Gibbs sampling, and popularized by Smith (1984), Hit-And-Run was given its name later due to its ability to run across the state-space and arrive at a distant “hit-point”. It is related to other algorithms with interesting names, such as Hide-And-Seek and Shake-And-Bake.

CHARM has one optional algorithm specification: alpha.star. When Specs=NULL, CHARM is non-adaptive. Otherwise, alpha.star is the target acceptance rate, and is recommended to be 0.44. When a target acceptance rate is declared, CHARM applies the Robbins-Monro stochastic approximation of Garthwaite et al. (2010) to the proposal distance to attain the target acceptance rate.

As a componentwise algorithm, the model is evaluated after a proposal is made for each parameter, which results in an algorithm that takes more time per iteration. As with the Metropolis-within-Gibbs (MWG) family, the time to complete each iteration grows with the number of parameters. Compared to other algorithms with multivariate proposals, a disadvantage is the time to complete each iteration increases as a function of parameters and model complexity. For example, in a 100-parameter model, CHARM completes its first iteration as HARM completes its 100th.

CHARM enjoys many of the advantages of HARM, such as having no tuning parameters (unless the adaptive form is used), traversing complex spaces with bounded sets in one iteration, not being adaptive (unless specified as adaptive), handling high correlations well, and having the potential to work well with multimodal distributions. When non-adaptive, CHARM may be used as a final algorithm.

References

  • Garthwaite P, Fan Y, Sisson S (2010). “Adaptive Optimal Scaling of Metropolis-Hastings Algorithms Using the Robbins-Monro Process.”
  • Gilks W, Roberts G (1996). “Strategies for Improving MCMC.” In W Gilks, S Richardson, D Spiegelhalter (eds.), Markov Chain Monte Carlo in Practice, p. 89-114. Chapman & Hall, Boca Raton, FL.
  • Smith R (1984). “Efficient Monte Carlo Procedures for Generating Points Uniformly Distributed Over Bounded Region.” Operations Research, 32, 1296-1308.
  • Turchin, VF (1971). “On the Computation of Multidimensional Integrals by the Monte Carlo Method”, Theory of Probability and its Applications, 16(4), 720-724.

See Also

Differential Evolution Markov Chain

(DEMC)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is moderate in difficulty. It has few algorithm specifications that are easy to specify, but proficiency comes with practice. This algorithm requires an additional matrix of initial values, and the number of chains depends on the number of parameters.
Final Algorithm? No.
Proposal Multivariate.


The original Differential Evolution Markov Chain (DEMC) (Ter Braak 2006), referred to in the literature as DE-MC, included a Metropolis step on a genetic algorithm called Differential Evolution with multiple chains (per parameter), in which the chains learn from each other. It could be considered as parallel adaptive direction sampling (ADS) with the Gibbs sampling (Gibbs) step replaced by a Metropolis step, or as a non-parametric form of Random-Walk Metropolis (RWM). However, for a model with J parameters, the original DEMC required at least 2J chains, and often required as many as 20J chains. Hence, from 2J to 20J model evaluations were required per iteration, whereas a typical multivariate sampler such as Adaptive-Mixture Metropolis (AMM) requires one evaluation, or a componentwise sampler such as Adaptive Metropolis-within-Gibbs (AMWG) requires J evaluations per iteration.

The version included here was presented in Ter Braak and Vrugt (2008), and the required number of chains is drastically reduced by adapting based on historical, thinned samples (the parallel direction move), and periodically using a snooker move instead. In the article, three chains were used to update as many as 50-100 parameters. Larger models may require blocks (as suggested in the article, and blocking is not included here), or more chains (see below). In testing, here, a few 200-dimensional models have been solved with 5-10 chains.

DEMC has four algorithm specifications: Nc is the number of chains (at least 3), Z is a \(T\) x \(J\) matrix or \(T\) x \(J\) x \(N_c\) array of \(T\) thinned samples for \(J\) parameters and \(N_c\) chains, gamma is a scale parameter, and w is the probability of a snooker move for each iteration. When gamma=NULL, the scale parameter defaults to the recommended 2.381204 / sqrt(2J), though for snooker moves, it is 2.381204 / sqrt(2) regardless of the algorithm specification. The default, recommended probability of a snooker move is w = 0.1.

The parallel direction move consists of a multivariate proposal for each chain in which two randomly selected previous iterations from two randomly selected other chains are differenced and scaled, and a small jitter, U ∼ (-0.001, 0.001)^J, is added. The snooker move differences these with another randomly selected (current) chain in the current iteration, and with a fixed scale. Another variation is to use snooker with past chain time-periods. The snooker move facilitates mode-jumping behavior for multimodal posteriors.

For the first update, Z is usually NULL. Internally, LaplacesDemon populates Z with GIV, and it is strongly recommended that PGF is specified by the user. As the sampler iterates, Z is used for adaptation, and elements of Z are replaced with thinned samples. A short, first run is recommended to obtain thinned samples, such as in Fit$Posterior1. For consecutive updates, this posterior matrix is used as Z.

In this implementation, samples are returned of the main chain, for which Initial.Values are specified. Samples for other chains (associated with PCIV) are not returned, but are used to assist the main chain. The authors note that too many chains can be problematic when an outlier chain exists. Here, samples of other chains are not returned, outlier or not. If an outlier chain exists, it simply does not help the main chain much and wastes computational resources, but does not negatively affect the main chain.

An attractive property is that DEMC does not require a proposal covariance matrix, but adapts instead based on past (thinned) samples. In the output of LaplacesDemon, the thinned samples are also stored in CovarDHis, though they are thinned samples, not the history of the diagonal of the covariance matrix. This is done so the absolute differences can be observed for diminishing adaptation. Another attractive property is that the chains may be parallelized, such as across CPUs, in the future, though the current version is not parallelized.

DEMC has been considered to perform like a version of Metropolis-within-Gibbs (MWG) that updates by chain, rather than by component. DEMC may be one form of a compromise between a one-evaluation multivariate proposal and J componentwise proposals. Since it is adaptive, DEMC is not recommended as a final algorithm.

References

  • Ter Braak C (2006). “A Markov Chain Monte Carlo Version of the Genetic Algorithm Differential Evolution: Easy Bayesian Computing for Real Parameter Spaces.” Statistics and Computing, 16, 239-249.
  • Ter Braak C, Vrugt J (2008). “Differential Evolution Markov Chain with Snooker Updater and Fewer Chains.” Statistics and Computing, 18(4), 435-446.

See Also

Delayed Rejection Adaptive Metropolis

(DRAM)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm that is best suited to models with a small to medium number of parameters. The proposal covariance matrix must be solved, and this matrix grows with the number of parameters.
Difficulty This algorithm is relatively easy for a beginner. It has few algorithm specifications, and these are easy to specify. However, since it is adaptive, the user must regard diminishing adaptation.
Final Algorithm? User Discretion. The RWM algorithm is recommended as the final algorithm, though DRAM may be used if diminishing adaptation occurs and adaptation ceases effectively.
Proposal Multivariate. Whenever a proposal is rejected, an alternate proposal is attempted.


The Delayed Rejection Adaptive Metropolis (DRAM) algorithm is merely the combination of both Delayed Rejection Metropolis (DRM) and Adaptive Metropolis (AM) (Haario, Laine, Mira, and Saksman 2006). DRAM has been demonstrated to be robust in extreme situations where DRM or AM fail separately. Haario et al. (2006) present an example involving ordinary differential equations in which least squares could not find a stable solution, and DRAM did well.

DRAM has two algorithm specifications: Adaptive is the iteration in which DRAM becomes adaptive, and Periodicity is the frequency in iterations of adaptation.

The DRAM algorithm is useful to assist AM when the acceptance rate is low. As an alternative, the Adaptive-Mixture Metropolis (AMM) is an extension of the AM algorithm that includes a mixture of proposals, and one mixture component has a small proposal standard deviation to assist in overcoming initially low acceptance rates. If DRAM is used for adaptation, then the final algorithm should be Random-Walk Metropolis (RWM).

References

  • Haario H, Laine M, Mira A, Saksman E (2006). “DRAM: Efficient Adaptive MCMC.” Statistical Computing, 16, 339-354.

See Also

Delayed Rejection Metropolis

(DRM)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm. The proposal covariance matrix should have been solved with an adaptive algorithm such as DRAM.
Difficulty This algorithm is easy for a beginner when the proposal covariance has been tuned with another algorithm. Otherwise, it may be tedious for the user to tune the proposal covariance matrix.
Final Algorithm? Yes.
Proposal Multivariate. Whenever a proposal is rejected, an alternate proposal is attempted.


The Delayed Rejection Metropolis (DRM or DR) algorithm is a Random-Walk Metropolis (RWM) with one, small twist. Whenever a proposal is rejected, the DRM algorithm will try one or more alternate proposals, and correct for the probability of this conditional acceptance. By delaying rejection, autocorrelation in the chains may be decreased, and the algorithm is encouraged to move. The additional calculations may slow each iteration of the algorithm in which the first set of proposals is rejected, but it may also converge faster. For more information on DRM, see Mira (2001).

DRM does not have any algorithm specifications.

DRM may be considered to be an adaptive MCMC algorithm, because it adapts the proposal based on a rejection. However, DRM does not violate the Markov property, because the proposal is based on the current state. DRM is not considered to be an adaptive MCMC algorithm here, because it is not adapting to the target distribution by considering previous states in the Markov chain, but merely makes more attempts from the current state.

LaplacesDemon also temporarily shrinks the proposal covariance arbitrarily by 50% for delayed rejection. A smaller proposal covariance is more likely to be accepted, and the goal of delayed rejection is to increase acceptance. In the long-term, a proposal covariance that is too small is undesirable, and so it is only used in this case to assist acceptance.

Since DRM is non-adaptive, it is suitable as a final algorithm.

References

  • Mira A (2001). “On Metropolis-Hastings Algorithms with Delayed Rejection.” Metron, LIX(3–4), 231-241.

See Also

Elliptical Slice Sampler

(ESS)


Aspect Description
Acceptance Rate The acceptance rate is 1.
Applications This algorithm is applicable only to models in which the prior mean of all parameters is zero.
Difficulty This algorithm is easy for a beginner. There is only one optional algorithm specification for block updating, and tuning is unnecessary, though optimal performance is gained with a user-specified prior covariance matrix for the ellipse.
Final Algorithm? Yes.
Proposal Multivariate.


Elliptical Slice Sampling (ESS) was introduced by Murray et al. (2010) as a multivariate extension of componentwise Slice sampling (Slice) that utilizes ellipse ν and angle θ. Each iteration, ellipse ν is drawn from N(0, Σ), where Σ is a user-specified prior covariance for the ellipse. The authors recommend using a form of the covariance of the data. Even though other parameters are not discussed, an identity matrix, or a combination thereof, performs well in practice. The prior covariance matrix is accepted as VarCov in the LaplacesDemon function, rather than the usual proposal covariance matrix.

ESS has one algorithm specification B for blocks of parameters, and defaults to B=NULL. For large-dimensional problems, block updating may be used. To specify block updating, the user gives the B specification a list in which each component is a block, and each component is a vector of integers pointing to the position of the parameters for that block. Block updating also requires the Covar argument to receive a list of prior covariance matrices of the appropriate dimension.

Each proposal is an additive, trigonometric function of the current state, ellipse ν, and angle θ. Angle θ is originally in the interval (0, 2π] each iteration, and is shrunk until acceptance occurs. This results in a 100% acceptance rate, and usually multiple model evaluations per iteration.

This algorithm is applicable only to models in which the prior mean of all parameters is zero. It is often possible to apply ESS when the variables are centered and scaled.

An advantage of ESS over the (componentwise) Slice algorithm is the computational efficiency of a multivariate proposal. A disadvantage is that the user must specify the prior covariance Σ for optimal ESS performance, and the algorithm can be very sensitive to this prior. Since ESS is not adaptive, it is suitable as a final algorithm.

References

  • Murray I, Adams R, KacKay D (2010). “Elliptical Slice Sampling.” Journal of Machine Learning Research, 9, 541-548.

See Also

Griddy-Gibbs

(GG)


Aspect Description
Acceptance Rate The acceptance rate is 1.
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is moderate in difficulty because the user-specified grid may need tuning. With an appropriate grid, the method is fully automatic. There is only one required algorithm specification for defining the grid. Additional optional algorithm specifications control parallelization or allow discrete parameters.
Final Algorithm? Yes.
Proposal Componentwise.


Introduced by Ritter and Tanner (1992), the Griddy-Gibbs (GG) sampler is a componentwise algorithm that approximates the conditional distribution by evaluating the model at a discrete set of points, the user-specified grid for each parameter. The proposal is a random draw from the approximated distribution. In this implementation, splinetic interpolation is used to approximate the distribution for continuous parameters with 1,000 points, given the evaluated points. The acceptance rate is 100%.

GG has five algorithm specifications: Grid is a vector or list of evenly-spaced points in the grid, dparm is a vector that indicates discrete parameters and defaults to NULL, CPUs is the number of available central processing units (CPUs), Packages is a vector of package names, and Dyn.lib is a vector of shared libraries. The default for Grid is seq(from=-0.1, to=0.1, len=5), which creates a grid with the values -0.1, -0.05, 0, 0.05, and 0.1. For each continuous parameter in each iteration, the grid values are added to the latest value of the parameter, and the model is evaluated with the parameter at each point on the grid. For each discrete parameter, the model is evaluated at each grid point and a sample is taken.

At least five grid points are recommended for a continuous parameter, and a grid with more points will better approximate the distribution, but requires more evaluations. However, the approximation in GG may be parallelized (within LaplacesDemon, not LaplacesDemon.hpc), so a large computer environment may approach an excellent approximation with little inefficiency. It is natural to desire a grid with a larger range, but in practice this becomes problematic, so it is recommended to keep the range of the grid relatively small, say within [-0.1,0.1] or [-0.2,0.2], and may require experimentation. After observing a Markov chain, the user may adjust the range of the grid to decrease autocorrelation in a future update.

When an odd number of grid points is used for a continuous parameter, the current position is evaluated. If there are too few grid points, then the current point may be the only point with appreciable density, and the parameter may never move. This can be checked afterward with the AcceptanceRate function. If the acceptance rate is less than one for any parameter, then there are too few grid points. This failure may be harder to find when there are numerous parameters, an even number of grid points, and too few grid points, because the acceptance rate may be 100% for a parameter, yet it may be oscillating between two values.

GG is one of the slower algorithms per iteration, since the model must be evaluated multiple times per parameter per iteration. This may mostly be alleviated in a parallel environment. GG seems appropriate when the problem must be solved with a componentwise algorithm, and excessive thinning is required. GG may help reduce thinning by making proposals from the approximated conditional distribution, though parameter correlation may increase autocorrelation.

Advantages of parallelized GG over most componentwise algorithms is that it yields a 100% acceptance rate, and draws from the approximated distribution should be less autocorrelated. A disadvantage is that more model evaluations are required, and even if a parallel environment had zero overhead, GG would be twice as slow per iteration as other componentwise algorithms. However, if the user tunes the grid, it may be more efficient than other componentwise algorithms. The Adaptive Griddy-Gibbs (AGG) sampler is available so the user can avoid tuning the grid. Since GG is not adaptive, it is suitable as a final algorithm.

References

  • Ritter C, Tanner M (1992). “Facilitating the Gibbs Sampler: the Gibbs Stopper and the Griddy-Gibbs Sampler.” Journal of the American Statistical Association, 87, 861-868.

See Also

Gibbs Sampler

(Gibbs)


Aspect Description
Acceptance Rate The acceptance rate is 1.
Applications This is a widely applicable, general-purpose algorithm that generally requires conjugacy.
Difficulty This algorithm is difficult for a beginner when the conditional distribution must be specified. Otherwise, it is fully automatic.
Final Algorithm? Yes.
Proposal Componentwise.


Gibbs sampling was introduced by Turchin (1971), and later by brothers Geman and Geman (1984). The Geman brothers named the algorithm after the physicist J. W. Gibbs, some eight decades after his death, in reference to an analogy between the sampling algorithm and statistical physics. Geman and Geman introduced Gibbs sampling in the context of image restoration.

Gibbs has two algorithm specifications: FC accepts a user-specified function to calculate full-conditionals, and MWG defaults to NULL and otherwise accepts a numeric vector that indicates which parameters are updated with Metropolis-within-Gibbs. FC accepts two arguments, parm and Data, just like the Model specification function, and returns the full vector of parameters parm. Each iteration, the full-conditional distributions are completed first, then MWG updates, if any.

In its basic version, Gibbs sampling is a special case of the Metropolis-Hastings algorithm. However, in its extended versions, Gibbs sampling can be considered a general framework for sampling from a large set of variables by sampling each variable (or in some cases, each group of variables) in turn, and can incorporate the Metropolis-Hastings algorithm (such as Metropolis-within-Gibbs or similar methods such as Slice sampling) to implement one or more of the sampling steps. Currently, Metropolis-within-Gibbs is included.

Gibbs sampling is applicable when the joint distribution is not known explicitly or is difficult to sample from directly, but the conditional distribution of each variable is known and easy to sample from. A Gibbs sampler generates a draw from the distribution of each parameter or variable in turn, conditional on the current values of the other parameters or variables. Therefore, a Gibbs sampler is a componentwise algorithm. As a simplified example, given a joint probability distribution \(p(θ_1, θ_2 | y)\), a Gibbs sampler would draw \(p(θ_1 | θ_2, y)\), then \(p(θ_2 | θ_1, y)\).

For a user to determine each conditional distribution, the joint distribution must be known first, then all parameters are held constant except the current parameter of interest, and the equation is simplified to the conditional distribution.

There are numerous variations of Gibbs sampling, such as blocked Gibbs sampling, collapsed Gibbs sampling, and ordered overrelaxation.

The advantage of Gibbs sampling to other MCMC algorithms is that it is often more efficient when it is appropriate, due to a 100% acceptance rate. The disadvantages are that a Gibbs sampler is appropriate only with conjugate distributions and low correlations between parameters, and therefore Gibbs sampling is less generally applicable than other MCMC algorithms. Since Gibbs is not adaptive, it is suitable as a final algorithm.

References

  • Geman S and Geman D (1984). “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721-741.
  • Turchin, VF (1971). “On the Computation of Multidimensional Integrals by the Monte Carlo Method”, Theory of Probability and its Applications, 16(4), 720-724.

See Also

Hit-And-Run Metropolis

(HARM)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is easy for a beginner. There are no required algorithm specifications (though two are optional) and tuning is unnecessary.
Final Algorithm? Yes, if not adaptive.
Proposal Blockwise or Multivariate. Proposals are multivariate only in the sense that proposals for multiple parameters are generated at once. However, proposals are not generated with a multivariate distribution and a proposal covariance matrix.


The Hit-And-Run algorithm is a variation of Random-Walk Metropolis (RWM) that has been around as long as Gibbs sampling (Gibbs). Hit-And-Run randomly samples a direction on the unit sphere as in Gilks and Roberts (1996), and a proposal is made for each parameter in its randomly-selected direction for a uniformly-distributed distance. This version of Hit-And-Run, called Hit-And-Run Metropolis (HARM), includes multivariate proposals and a Metropolis step for rejection. Introduced by Turchin (1971), along with the original Gibbs sampler, and popularized by Smith (1984), Hit-And-Run was given its name later due to its ability to run across the state-space and arrive at a distant “hit-point”. It is related to other algorithms with interesting names, such as Hide-And-Seek and Shake-And-Bake.

HARM has two optional algorithm specifications: alpha.star is the target acceptance rate, and B is a list of blocked parameters. When alpha.star=NULL, non-adaptive HARM is used. Otherwise, the Robbins-Monro stochastic approximation of Garthwaite et al. (2010) is applied to the proposal distance. alpha.star=0.234 is recommended. For blockwise sampling, each component of the list is a block and consists of a vector indicating the position of the parameters per block.

HARM is the fastest MCMC algorithm per iteration in the LaplacesDemon package, because it is very simple. For example, HARM does not use a proposal covariance matrix, and there are no tuning parameters, with one optional exception discussed below. The proposal is multivariate in the sense that all parameters are proposed at once, though from univariate draws. HARM often mixes better than the Gibbs sampler (Gilks and Roberts 1996), especially with correlated parameters (Chen and Schmeiser 1992).

Adaptive HAR (without the Metropolis step) with a multivariate proposal is available in the LaplaceApproximation function.

The HARM algorithm is able to traverse complex spaces with bounded sets in one iteration. As such, HARM may work well with multimodal posteriors due to potentially good mode-switching behavior. However, HARM may have difficulty sampling regions of high probability that are spike-shaped or near constraints, and this difficulty is likely to be more problematic in high dimensions. When HARM is non-adaptive, it may be used as a final algorithm.

References

  • Chen M, Schmeiser B (1992). “Performance of the Gibbs, Hit-And-Run and Metropolis Samplers.” Journal of Computational and Graphical Statistics, 2, 251-272.
  • Garthwaite P, Fan Y, Sisson S (2010). “Adaptive Optimal Scaling of Metropolis-Hastings Algorithms Using the Robbins-Monro Process.”
  • Gilks W, Roberts G (1996). “Strategies for Improving MCMC.” In W Gilks, S Richardson, D Spiegelhalter (eds.), Markov Chain Monte Carlo in Practice, p. 89-114. Chapman & Hall, Boca Raton, FL.
  • Smith R (1984). “Efficient Monte Carlo Procedures for Generating Points Uniformly Distributed Over Bounded Region.” Operations Research, 32, 1296-1308.
  • Turchin, VF (1971). “On the Computation of Multidimensional Integrals by the Monte Carlo Method”, Theory of Probability and its Applications, 16(4), 720-724.

See Also

Hamiltonian Monte Carlo

(HMC)


Aspect Description
Acceptance Rate The optimal acceptance rate is 65% when L > 1, or 57.4% when L = 1. The observed acceptance rate may be suitable in the interval [60%,70%] when L > 1, or [40%,80%] when L = 1.
Applications This is a widely applicable, general-purpose algorithm that is best suited to models with a small number of parameters. The number of model evaluations per iteration increases with the number of parameters.
Difficulty This algorithm is difficult for a beginner. It has three algorithm specifications, all require tuning, and tuning is difficult.
Final Algorithm? Yes.
Proposal Multivariate. Proposals are multivariate only in the sense that proposals for multiple parameters are generated at once. Each iteration involves numerous proposals, due to partial derivatives and L.


Introduced under the name of hybrid Monte Carlo (Duane, Kennedy, Pendleton, and Roweth 1987), the name Hamiltonian Monte Carlo (HMC) surpasses it in popularity in statistics literature. HMC introduces auxiliary momentum variables with independent, Gaussian proposals. Momentum variables receive alternate updates, from simple updates to Metropolis updates. Metropolis updates result in the proposal of a new state by computing a trajectory according to Hamiltonian dynamics, from physics. Hamiltonian dynamics is discretized with the leapfrog method. In this way, distant jumps can be proposed and random-walk behavior avoided.

HMC has three algorithm specifications: a vector of the step size of the leapfrog steps, epsilon (ε), that is equal in length to the number of parameters, the number of leapfrog steps, L, and an optional mass matrix M. When L=1, HMC reduces to Langevin Monte Carlo (LMC), also called the Metropolis-Adjusted Langevin Algorithm (MALA), introduced by Rossky, Doll, and Friedman (1978). These tuning parameters must be adjusted until the acceptance rate is appropriate. The optimal acceptance rate of HMC is 65%, or in the case of LMC or MALA 57.4% is optimal. Tuning ε and L, however, is very difficult. The trajectory length, εL must also be considered.

Suggestions for tuning ε and L are found in Neal (2011). When ε is too large, the algorithm becomes unstable and suffers from a low acceptance rate. When ε is too small, the algorithm takes too many small steps and is inefficient. When L is too large, trajectory lengths (εL) result in double-back behavior and become computationally self-defeating. When L is too small, more random-walk behavior occurs and mixing becomes slower. Even if ε and L are tuned optimally, a well-tuned mass matrix may be necessary for efficient sampling.

If a user is new to tuning HMC algorithms, then good advice may be to leave L = 1 and begin with small values for ε, say 0.1 or smaller. It is easy to experience problems when inexperienced, but HMC is a rewarding algorithm once proficiency is acquired. As can be expected, the adaptive extensions (AHMC, HMCDA, and NUTS), will also be easier, since ε is adapted and does not require tuning (and in the case of NUTS, \(L\) does not require tuning).

Although HMC generates multivariate proposals, parameter correlation is not taken into account unless a mass matrix is used. Some sources refer to a mass matrix in HMC as a covariance matrix, and some as a precision matrix. Here, the mass matrix is a covariance matrix. It is difficult to tune for several reasons. The optimal mass matrix is different with different configurations of the parameters, so what was optimal at one point is sub-optimal at another. It is not recommended to substitute the historical covariance matrix of samples while pursuing equilibrium.

Partial derivatives are required, and hence the parameters must be differentiable everywhere the algorithm explores. Partial derivatives are computationally intensive, and computational expense increases with the number of parameters. For \(K\) parameters and \(L\) leapfrog steps, there are \(L\) + KL evaluations of the model specification function per iteration. In practice, starting any of the algorithms in the HMC family (AHMC, HMC, HMCDA, or THMC) in a region that is distant from density may result in failure due to differentiation, unless manipulated with priors.

Since HMC requires the approximation of partial derivatives, it is slower per iteration than most algorithms, and much slower in higher dimensions. Tuned well, HMC is an excellent algorithm, but tuning can be very difficult. The AHMC algorithm and HMCDA are adaptive versions of HMC in which ε is adapted based on recent history of acceptance and rejection. The NUTS algorithm is a fully adaptive version that does not require tuning of ε or \(L\). Since HMC is not adaptive, it is suitable as a final algorithm.

References

  • Duane S, Kennedy AD, Pendleton BJ, Roweth D (1987). “Hybrid Monte Carlo.” Physics Letters, B(195), 216-222.
  • Neal R (2011). “MCMC for Using Hamiltonian Dynamics.” In S Brooks, A Gelman, G Jones, M Xiao-Li (eds.), Handbook of Markov Chain Monte Carlo, p. 113-162. Chapman & Hall, Boca Raton, FL.
  • Rossky P, Doll J, Friedman H (1978). “Brownian Dynamics as Smart Monte Carlo Discrete Approximations.” Journal of Chemical Physics, 69, 4628-4633.

See Also

Hamiltonian Monte Carlo with Dual-Averaging

(HMCDA)


Aspect Description
Acceptance Rate The user specifies the target acceptance rate as delta (65% is recommended). A recommended, suitable, observed acceptance rate may be in the interval [delta-5%,delta+5%].
Applications This is a widely applicable, general-purpose algorithm that is best suited to models with a small number of parameters. The number of model evaluations per iteration increases with the number of parameters.
Difficulty This algorithm is relatively difficult for a beginner. It has several algorithm specifications. Since it is an adaptive algorithm, the user must regard diminishing adaptation.
Final Algorithm? Yes, when non-adaptive, otherwise: User Discretion.
Proposal Multivariate. Proposals are multivariate only in the sense that proposals for multiple parameters are generated at once. However, proposals are not generated with a multivariate distribution and a proposal covariance matrix. Each iteration involves numerous proposals, due to partial derivatives and L.


The Hamiltonian Monte Carlo with Dual-Averaging (HMCDA) algorithm is an extension of Hamiltonian Monte Carlo (HMC) that adapts the scalar step-size parameter, ε, according to dual-averaging. This is algorithm #5 in Hoffman and Gelman (2012).

HMCDA has five algorithm specifications: the number of adaptive iterations A, the target acceptance rate delta or δ (and 0.65 is recommended), a scalar step-size epsilon or ε, and the trajectory length lambda or λ. The trajectory length is scalar λ = εL, where \(L\) is the unspecified number of leapfrog steps that is determined from ε and λ. Each iteration i ≤ A, HMCDA adapts and coerces the target acceptance rate δ.

As with HMC, the HMCDA algorithm is slower per iteration than many other algorithms, but often produces chains with good mixing. HMCDA should outperform Adaptive Hamiltonian Monte Carlo (AHMC), and iterates faster as well, unless L becomes large. When mixing is inadequate, consider switching to the MALA or NUTS algorithm. When parameters are highly correlated, another algorithm should be preferred in which correlation is taken into account, such as AMM, or in which the algorithm is generally invariant to correlation, such as twalk. When adaptive, it is not suitable as a final algorithm.

References

  • Hoffman M, Gelman A (2012). “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research, 1-30.

See Also

Independence Metropolis

(IM)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm that is best suited to models with a small to medium number of parameters. The proposal covariance matrix must be solved, and this matrix grows with the number of parameters.
Difficulty This algorithm is relatively easy for a beginner. It has few algorithm specifications, and these are easy to specify.
Final Algorithm? Yes.
Proposal Multivariate.


Proposed by Hastings (1970) and popularized by Tierney (1994), the Independence Metropolis (IM) algorithm (also called the independence sampler) is an algorithm in which the proposal distribution does not depend on the previous state or iteration. The proposal distribution must be a good approximation of the target distribution for IM to perform well, and the proposal distribution should have slightly heavier tails than the target distribution. IM is used most often to obtain additional posterior samples given an algorithm that has already converged.

IM has one algorithm specification: mu. The usual approach to IM is to update the model with Laplace Approximation, and then supply the posterior mode and covariance to IM. The posterior mode vector of Laplace Approximation becomes the mu argument in the algorithm specifications for IM. The covariance matrix from Laplace Approximation is expanded by multiplying it by 1.1 so that it has heavier tails. Each iteration, IM draws from a multivariate normal distribution as the proposal distribution. Alternatively, posterior means and covariances may be used from other algorithms, such as other MCMC algorithms.

Since IM is non-adaptive and uses a proposal distribution that remains fixed for all iterations, it may be used as a final algorithm.

References

  • Hastings, W.K. (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications.” Biometrika, 57(1), 97-109.
  • Tierney, L. (1994). “Markov Chains for Exploring Posterior Distributions.” The Annals of Statistics, 22(4), 1701-1762.

See Also

  • none

Interchain Adaptation

(INCA)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm for parallel chains only. It is best suited to models with a small to medium number of parameters. The proposal covariance matrix must be solved, and this matrix grows with the number of parameters.
Difficulty This algorithm is relatively easy for a beginner. It has few algorithm specifications, and these are easy to specify. However, since it is adaptive, the user must regard diminishing adaptation.
Final Algorithm? User Discretion. The RWM algorithm is recommended as the final algorithm, though INCA may be used if diminishing adaptation occurs and adaptation ceases effectively.
Proposal Multivariate.


The Interchain Adaptation (INCA) algorithm of Craiu et al. (2009) is an extension of the Adaptive Metropolis (AM) algorithm of Haario et al. (2001). Craiu et al. (2009) refer to INCA as inter-chain adaptation and inter-chain adaptive MCMC. INCA uses parallel chains that are independent, except that they share the adaptive component, and this sharing speeds convergence. Since parallel chains are a defining feature of INCA, in the LaplacesDemon package this algorithm works only with the function for high performance computing, LaplacesDemon.hpc, not LaplacesDemon.

INCA has two algorithm specifications: Adaptive is the iteration in which it becomes adaptive, and Periodicity is the frequency in iterations of adaptation.

As with AM, the proposal covariance matrix is set equal to the historical (or sample) covariance matrix. Ample pre-adaptive iterations are recommended (Craiu et al., 2009), and initial values should be dispersed to aid in discovering multimodal marginal posterior distributions. After adaptation begins, INCA combines the historical covariance matrices from all parallel chains during each adaptation. Each chain learns from experience as in AM, and in INCA, each chain also learns from the other parallel chains.

Solonen et al. (2012) found a dramatic reduction in the number of iterations to convergence when INCA used 10 parallel chains, compared against a single-chain AM algorithm. Similar improvements have been noted in the LaplacesDemon package, though more time is required per iteration.

The Gelman.Diagnostic is recommended by Craiu et al. (2009) to determine when the parallel chains have stopped sharing different information about the target distributions. The exchange of information between chains decreases as the number of iterations increases.

This implementation of INCA uses a server function that is built into LaplacesDemon.hpc. If the connection to this server fails, then the user must kill the process and then close all open connections with the closeAllConnections function.

Since INCA is an adaptive algorithm, the final algorithm should be Random-Walk Metropolis (RWM).

References

  • Craiu R, Rosenthal J, Yang C (2009). “Learn From Thy Neighbor: Parallel-Chain and Regional Adaptive MCMC.” Journal of the American Statistical Association, 104(488), 1454-1466.
  • Haario H, Saksman E, Tamminen J (2001). “An Adaptive Metropolis Algorithm.” Bernoulli, 7(2), 223-242.
  • Solonen A, Ollinaho P, Laine M, Haario H, Tamminen J, Jarvinen H (2012). “Efficient MCMC for Climate Model Parameter Estimation: Parallel Adaptive Chains and Early Rejection.” Bayesian Analysis, 7(2), 1-22.

See Also

Metropolis-Adjusted Langevin Algorithm

(MALA)


Aspect Description
Acceptance Rate The optimal acceptance rate is 57.4%. The observed acceptance rate may be suitable in the interval [40%,80%].
Applications This is a widely applicable, general-purpose algorithm. The number of model evaluations per iteration increases with the number of parameters.
Difficulty This algorithm is easy for a beginner to use. Although it has five algorithm specifications, it is fully automatic at the default values.
Final Algorithm? Yes, when not adaptive.
Proposal Multivariate.


Also called Langevin Monte Carlo (LMC), the Metropolis-Adjusted Langevin Algorithm (MALA) was proposed in Roberts and Tweedie (1996), an adaptive version in Atchade (2006), and an alternative adaptive version in Shaby and Wells (2010). MALA was inspired by stochastic models of molecular dynamics. MALA is an extension of the multivariate random-walk Metropolis (RWM) algorithm that includes partial derivatives to improve mixing. Roberts and Tweedie (1996) presented ULA, MALA, and MALTA, and recommended MALTA. MALTA is a refinement of MALA that uses a truncated drift, where the drift parameter is a step-size parameter for the partial derivatives. The non-adaptive version of MALA is nearly equivalent to the Hamiltonian Monte Carlo (HMC) algorithm with L=1 leapfrog steps, except MALA also includes a proposal covariance matrix.

The original, non-adaptive MALA family is difficult to tune, and optimal tuning differs between transience and stationarity. For this reason, the MALA presented here is the adaptive version presented in Atchade (2006). This adaptive MALA uses stochastic approximation to update an expectation, scale, and covariance every iteration. The scale of the proposal is adapted to obtain a target acceptance rate.

This version of MALA has five algorithm specifications: A defaults to 1e7 as the maximum acceptable value of the Euclidean norm of the adaptive parameters mu and sigma, and the Frobenius norm of the covariance matrix, alpha.star is the target acceptance rate and defaults to 57.4%, gamma defaults to 1 and accepts a constant in [0,T] where \(T\) is iterations and controls decay in adaptation, delta defaults to 1 and is a constant in the bounded drift function, and epsilon is a vector of length two that defaults to c(1e-6,1e-7), in which the first element is the acceptable minimum of adaptive scale sigma, and the second element is added to the diagonal of the covariance matrix for regularization. The expectation, scale, and covariance adapt each iteration, and the amount of adaptation is a decreasing function gamma/t for \(T\) iterations. When gamma=0, the algorithm does not adapt. Otherwise, gamma is the iteration through which full adaptation occurs and after which adaptation decays. The optimal acceptance rate is 57.4%, and is acceptable in the interval [40%, 80%].

MALA approximates partial derivatives for multivariate proposals. Approximating partial derivatives is computationally expensive, and requires \(J + 1\) model evaluations per \(J\) parameters per iteration. This results in a multivariate algorithm that is slightly slower per iteration than a traditional componentwise algorithm, though the higher acceptance rate and additional information from partial derivatives may allow it to mix better and approach convergence faster. Unlike most componentwise algorithms, this version of MALA accounts for parameter correlation.

When non-adaptive, MALA is suitable as a final algorithm.

References

  • Atchade Y (2006). “An Adaptive Version for the Metropolis Adjusted Langevin Algorithm with a Truncated Drift.” Methodology and Computing in Applied Probability, 8, 235-254.
  • Roberts G, Tweedie R (1996). “Exponential Convergence of Langevin Distributions and Their Discrete Approximations.” Bernoulli, 2(4), 341-363.
  • Shaby B, Wells M (2010). “Exploring an Adaptive Metropolis Algorithm.” Working Paper in Department of Statistical Science, Duke University.

See Also

Metropolis-Coupled Markov Chain Monte Carlo

(MCMCMC)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm that specializes in representing multimodal distributions.
Difficulty This algorithm is moderately difficult for a beginner, because the proposal covariance must be tuned, the swap acceptance rate should be tuned, and the number of CPUs selected.
Final Algorithm? Yes.
Proposal Componentwise.


The Metropolis-Coupled Markov Chain Monte Carlo (MCMCMC) algorithm – also referred to as MC3, (MC)3, Metropolis-Coupled MCMC, Parallel Tempering, or Replica Exchange – was introduced by Geyer (1991) for multimodal distributions. The name “parallel tempering” was introduced in Earl and Deem (2005).

MCMCMC consists of parallel chains that use different temperatures and combine to form a mixture distribution, where each chain is a mixture component. The chains or mixture components are updated in parallel, each with a Metropolis step that is adjusted by temperature. After this first set of Metropolis steps, two parallel chains are selected at random, and another adjusted Metropolis step is used to accept or reject a swap between chains. The act of swapping is referred to as coupling. Each swap may be a jump between modes. MCMCMC has both within-chain and between-chain proposals each iteration, but only the coolest chain is retained.

MCMCMC has four algorithm specifications: lambda is either a positive-only scalar that controls the temperature spacing or a vector of temperature spacings, CPUs is the number of central processing units in the computer, Packages is a vector of any package names that are required, and Dyn.libs are any dynamic libraries that are required. A larger scalar value of λ results in greater differences in temperature between chains, and often a lower swap acceptance rate (see below). Given a scalar λ and m = 1,…,M CPUs, the current or proposal distribution is exponentiated to this power: 1/[1 + λ(m - 1)]. The number of CPUs must be at least two; as programmed, MCMCMC will not function on a single-core computer.

MCMCMC must be called with LaplacesDemon, not LaplacesDemon.hpc, even though MCMCMC requires parallel chains. Each iteration, LaplacesDemon communicates with all CPUs, collects the latest for all the chains, and then the master process calculates the Metropolis steps for each chain, as well as the swap.

As with Random-Walk Metropolis (RWM), the proposal covariance matrix must be tuned. When nothing is known about this matrix, as is often the case, it is suggested to begin with an identity matrix with small-scale diagonal values, such as 1e-3. After a short run that hopefully has some acceptances, another short run can begin with the observed historical covariance matrix. Eventually, the proposal covariance matrix is usually satisfactory.

Atchade et al. (2011) demonstrate that MCMCMC performance improves as the acceptance rate of proposed swaps approaches 0.234. In LaplacesDemon, the swap acceptance rate is printed to the console at the end of each update. The swap acceptance rate is affected by the temperature spacing between parallel chains. If the default temperature spacing that results from a scalar λ is unacceptable, then a different scalar value may be attempted, or the user may enter their own temperature spacing directly as a vector for λ.

Coupling induces dependence among the chains, and the chains are no longer Markovian. The whole stochastic process of m chains together does form a Markov chain.

Along with the J-walking algorithm of Frantz et al. (1990), MCMCMC was one of the first extensions of Metropolis-Hastings for multimodal distributions. Many additional MCMC algorithms, such as serial tempering or simulated tempering, were influenced by or variations of MCMCMC.

The advantage of MCMCMC over RWM is that MCMCMC is better able to approximate multimodal distributions, and that successful coupling (swaps) improves mixing. A disadvantage is that most of the information in the warmer chains is lost, speed per iteration is slower than RWM due to communication with CPUs, and distributions with many modes require at least as many CPUs. Since MCMCMC is not adaptive, it is suitable as a final algorithm.

References

  • Atchade YF, Roberts GO, Rosenthal JS (2011). “Towards Optimal Scaling of Metropolis-Coupled Markov Chain Monte Carlo.” Statistics and Computing, 21(4), 555-568.
  • Earl, DJ and Deem, MW (2005). “Parallel Tempering: Theory, Applications, and New Perspectives.” Physical Chemistry Chemical Physics, 7, 3910-3916.
  • Frantz DD, Freeman JD and Doll JL (1990). “Reducing Quasi-Ergodic Behavior in Monte Carlo Simulations by J-Walking: Applications to Atomic Clusters.” Journal of Chemical Physics, 93, 2769-2784.
  • Geyer CJ (1991). “Markov Chain Monte Carlo Maximum Likelihood.” In Keramidas, E.M. Computing Science and Statistics: Proceedings of the 23rd Symposium of the Interface. Fairfax Station VA: Interface Foundation. p. 156-163.

See Also

Multiple-Try Metropolis

(MTM)


Aspect Description
Acceptance Rate The optimal acceptance rate is above 44%, and increases with the number of tries. The observed acceptance rate may be suitable in the interval [10%,90%].
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is moderate in difficulty because the number of proposals and user-specified proposal variance may need tuning. Additional optional algorithm specifications control parallelization.
Final Algorithm? Yes.
Proposal Componentwise.


The Multiple-Try Metropolis (MTM) algorithm was introduced in Liu et al. (2000) as a componentwise Metropolis algorithm with an improved acceptance rate due to an increased proposal variance and number of proposals. The user specifies that K proposals will be made. For each parameter at each iteration, K normally-distributed proposals are made around the current parameter, scaled according to the user-specified proposal variance. Each proposal is weighted according to its unnormalized joint posterior density. One proposal is selected randomly, with probability proportional to the weights. A reference set of length K is created, in which the first K-1 elements are draws from a normal distribution centered around the selected proposal, again according to proposal variance. The last element, K, is the selected proposal itself. A Metropolis step is performed for the weighted reference set. If the weighted reference set is accepted, then the selected proposal becomes the new value for the parameter.

MTM has four algorithm specifications: K is the number of proposals, CPUs is the number of central processing units, Packages accepts any package required for the model function, and Dyn.libs accepts dynamic libraries for parallelization, if required.

Liu (2000) demonstrate the combination of MTM with the Snooker algorithm from Adaptive Directional Sampling (ADS), Conjugate Gradient Monte Carlo (CGMC), Hit-And-Run modified as Random-Ray Monte Carlo (RRMC), and Griddy-Gibbs (GG). MTM has since been extended to multivariate proposals, proposals with different scales, and more.

Advantages of MTM over Metropolis-within-Gibbs (MWG) is that the acceptance rate is higher, and multiple evaluations of the model specification function are parallelized each iteration. The advantage of MTM over Griddy-Gibbs (GG) is exact rather than approximate estimation and that an equilibrium distribution cannot be guaranteed from an approximation of the conditional such as in GG. A disadvantage of MTM compared to MWG is that MTM must evaluate the model specification function multiple times per parameter per iteration, resulting in an algorithm that is slower per iteration. Since MTM is not adaptive, it is suitable as a final algorithm.

References

  • Liu J, Liang F, and Wong W (2000). “The Multiple-Try Method and Local Optimization in Metropolis Sampling.” Journal of the American Statistical Association, 95, 121-134.

See Also

Metropolis-within-Gibbs

(MWG)


Aspect Description
Acceptance Rate The optimal acceptance rate is 44%, and is based on the univariate normality of each marginal posterior distribution. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is easy for a beginner when the proposal variance has been tuned with another algorithm. Otherwise, it may be tedious for the user to tune the proposal variance.
Final Algorithm? Yes.
Proposal Componentwise.


Metropolis-within-Gibbs (MWG) is the original MCMC algorithm, introduced in Metropolis et al. (1953). Since it was the original MCMC algorithm, it pre-dated Gibbs sampling (Gibbs), and was not known as Metropolis-within-Gibbs. MWG was later proposed as a hybrid algorithm that combines Metropolis-Hastings and Gibbs sampling, and was suggested in Tierney (1994). The idea was to substitute a Metropolis step when Gibbs sampling fails. MWG is available in this respect in the Gibbs sampling algorithm. However, Gibbs sampling is not included in this particular version in LaplacesDemon (or most versions), making it an algorithm with a misleading name. Without Gibbs sampling, the more honest name would be componentwise random-walk Metropolis, but the more common name is MWG. MWG is also referred to as Metropolis within Gibbs, Metropolis-in-Gibbs, Random-Walk Metropolis-within-Gibbs, single-site Metropolis, the Metropolis algorithm, or Variable-at-a-Time Metropolis.

Componentwise

MWG is a componentwise algorithm, meaning that each parameter is updated individually each iteration. This implies that the model specification function is evaluated a number of times equal to the number of parameters, per iteration. Componentwise sampling usually ignores parameter correlation. The order of the parameters for updating is randomized each iteration (random-scan MWG), as opposed to sequential updating (deterministic-scan MWG). MWG often uses blocks, and blockwise sampling may be specified with the algorithm specification B, in which a list is supplied, and each list component consists of a vector of parameters. Blocks are updated sequentially, but the order of parameters is randomized within each block.

Metropolis Step

The update of each parameter occurs in a Metropolis step, otherwise called an accept/reject step or a Metropolis accept/reject step. A componentwise proposal is generated randomly and the model is evaluated with the proposed parameter. If the proposal is an improvement in the logarithm of the unnormalized joint posterior density, then the proposal is accepted. If the proposal is not an improvement, then the proposal is accepted or rejected according to a probability.

Acceptance Rate

The acceptance rate is the total number of proposals accepted in the Metropolis step divided by the total number of iterations. If the acceptance rate is too high, then the proposal variance is too small. In this case, the algorithm will take longer than necessary to find the target distribution and the samples will be highly autocorrelated. If the acceptance rate is too low, then the proposal variance is too large, and the algorithm is ineffective at exploration. In the worst case scenario, no proposals are accepted and the algorithm fails to move. Under theoretical conditions, the optimal acceptance rate for a sole, independent and identically distributed (IID), Gaussian, marginal posterior distribution is 0.44 or 44%. The optimal acceptance rate for an infinite number of distributions that are IID and Gaussian is 0.234 or 23.4%. Since MWG is a componentwise algorithm, it is most efficient when the acceptance rate of each parameter is 0.44.

Random-Walk Behavior

MWG is a componentwise Random-Walk Metropolis (RWM) algorithm. Random-walk behavior is desirable because it allows the algorithm to explore, and hopefully avoid getting trapped in undesirable regions. On the other hand, random-walk behavior is undesirable because it takes longer to converge to the target distribution while the algorithm explores. The algorithm generally progresses in the right direction, but may periodically wander away. Such exploration may uncover multimodal target distributions, which other algorithms may fail to recognize, and then converge incorrectly. With enough iterations, MWG is guaranteed theoretically to converge to the correct target distribution, regardless of the starting point of each parameter, provided the proposal variance for each proposal of a target distribution is sensible.

Historically, MWG first ran on an early computer that was built specifically for it, called MANIAC I. Metropolis discarded the first 16 iterations as burn-in, and updated an additional 48-64 iterations, which required 4-5 hours on MANIAC I.

Roberts and Rosenthal (2007) introduced an adaptive version of Metropolis-within-Gibbs, called Adaptive Metropolis-within-Gibbs (AMWG). Bai (2009) extended this to the Adaptive Directional Metropolis-within-Gibbs (ADMG) algorithm.

The advantage of MWG over the multivariate version, RWM, is that it is more efficient with information per iteration, so convergence is faster in iterations. The disadvantages of MWG are that covariance is not included in proposals, and it is more time-consuming due to the evaluation of the model specification function for each parameter per iteration. As the number of parameters increases, and especially as model complexity increases, the run-time per iteration increases. Since fewer iterations are completed in a given time-interval, the possible amount of thinning is also at a disadvantage. MWG is non-adaptive, and is suitable as a final algorithm.

References

  • Bai Y (2009). “An Adaptive Directional Metropolis-within-Gibbs Algorithm.” Technical Report in Department of Statistics at the University of Toronto.
  • Metropolis N, Rosenbluth A, MN R, Teller E (1953). “Equation of State Calculations by Fast Computing Machines.” Journal of Chemical Physics, 21, 1087-1092.
  • Roberts G, Rosenthal J (2007). “Coupling and Ergodicity of Adaptive Markov Chain Monte Carlo Algorithms.” Journal of Applied Probability, 44, 458-475.
  • Tierney L (1994). “Markov Chains for Exploring Posterior Distributions.” The Annals of Statistics, 22(4), 1701-1762. With discussion and a rejoinder by the author.

See Also

No-U-Turn Sampler

(NUTS)


Aspect Description
Acceptance Rate The user specifies the target acceptance rate as delta (60% is recommended). A recommended, suitable, observed acceptance rate may be in the interval [delta-5%,delta+5%].
Applications This is a widely applicable, general-purpose algorithm that is best suited to models with a small number of parameters. The number of model evaluations per iteration increases with the number of parameters.
Difficulty This algorithm is relatively easy for a beginner. It has only a few algorithm specifications. However, since it is adaptive, the user must regard diminishing adaptation.
Final Algorithm? Yes, if not adaptive, otherwise: User Discretion.
Proposal Multivariate. Proposals are multivariate only in the sense that proposals for multiple parameters are generated at once. However, proposals are not generated with a multivariate distribution and a proposal covariance matrix. Each iteration involves numerous proposals, due to partial derivatives and L.


The No-U-Turn Sampler (NUTS) is an extension of Hamiltonian Monte Carlo (HMC) that adapts both the scalar step-size ε and scalar number of leapfrog steps L. This is algorithm #6 in Hoffman and Gelman (2012).

NUTS has four algorithm specifications: the number of adaptive iterations A, the target acceptance rate delta or δ (and 0.6 is recommended), a scalar step-size epsilon or ε, and the maximum number of leapfrog steps Lmax.

Each iteration \(i ≤ A\), NUTS adapts both ε and \(L\), and coerces the target acceptance rate δ. \(L\) continues to change after adaptation ends, but is not an adaptive parameter in the sense of destroying ergodicity. The adaptive samples are discarded and only the thinned non-adaptive samples are returned.

The main advantage of NUTS over other HMC algorithms is that NUTS is the algorithm most likely to produce approximately independent samples, in the sense of low autocorrelation. Due to computational complexity, NUTS is slower per iteration than HMC, and the HMC family is among the slowest. Despite this, NUTS often produces chains with excellent mixing, and should outperform other adaptive versions of HMC, such as AHMC and HMCDA. Per iteration, NUTS should generally perform better than other HMC algorithms. Per minute, however, is another story.

NUTS has been extended elsewhere to allow for a non-diagonal mass matrix (proposal covariance matrix for momentum). This extension are not yet included here.

In complex and high-dimensional models, NUTS may produce approximately independent samples much more slowly in minutes than other MCMC algorithms, such as Adaptive Metropolis-within-Gibbs (AMWG). This is because the combination of calculating partial derivatives and the search each iteration for L is computationally intensive.

References

  • Hoffman M, Gelman A (2012). “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research, 1-30.

See Also

Oblique Hyperrectangle Slice Sampler

(OHSS)


Aspect Description
Acceptance Rate The acceptance rate is 1.
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is easy to use and does not require tuning.
Final Algorithm? Yes, when it is non-adaptive.
Proposal Multivariate.


Oblique Hyperrectangle Slice Sampler (OHSS) is an adaptive algorithm that approximates the sample covariance matrix. Introduced by Thompson (2011) where it is referred to as “Oblique Hyperrect”, OHSS is an extension of the hyperrectangle method in Neal (2003), which in turn is a multivariate extension of the slice sampler (Slice). The initial hyperrectangle is oriented to have edges that are parallel to the eigenvectors of the sample covariance matrix. The initial hyperrectangle is also scaled to have lengths proportional to the square roots of the eigenvalues of the sample covariance matrix. Rejection sampling is performed, and when a sample is rejected, the slice approximation is shrunk. Aside from specifying the initial hyperrectangle, this is the hyperrectangle method in Neal (2003).

OHSS has two algorithm specifications: A is the number of adaptive iterations, n is the number of previous iterations, and defaults to zero. The number of previous iterations, if any, is used to weight the sample covariance matrix.

Each iteration, there is a 5% probability that a non-adaptive step is taken, rather than an adaptive step. The first ten iterations or so are non-adaptive. Eigenvectors are estimated no more than every tenth iteration. When adaptive, the sample covariance matrix is updated along with the eigenvectors.

Once the slice interval is estimated, a sample is drawn uniformly with rejection sampling from within this interval. If rejected, then the interval is shrunk until an acceptable sample is drawn. OHSS has a 100% acceptance rate.

The time per iteration varies, since rejection sampling often requires more than one evaluation. When OHSS is adaptive, it is not suitable as a final algorithm.

References

  • Thompson MD (2011). “Slice Sampling with Multivariate Steps.” http://hdl.handle.net/1807/31955
  • Neal R (2003). “Slice Sampling (with Discussion).” Annals of Statistics, 31(3), 705-767.

See Also

Preconditioned Crank-Nicolson

(pCN)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm that is most useful with large-dimensional models.
Difficulty This algorithm is easy to use, having only one tuning parameter.
Final Algorithm? Yes.
Proposal Multivariate.


Preconditioned Crank-Nicolson (pCN) was introduced originally as the PIA algorithm in Beskos et al. (2008), and differs only slightly from Random-Walk Metropolis (RWM). The proposal is a first-order autoregressive process, rather than a centered random-walk. The pCN algorithm has an acceptance probability that is invariant to dimension, making pCN useful for large-dimensional models.

pCN has one algorithm specification, beta, which is the autoregressive parameter in (0,1).

As with RWM, the target acceptance rate is 23.4%. pCN seems to perform well only with an identity matrix as the proposal covariance matrix.

Since pCN is not adaptive, it is suitable as a final algorithm.

References

  • Beskos A, Roberts GO, Stuart AM, Voss J (2008). “MCMC Methods for Diffusion Bridges.” Stoch. Dyn., 8, 319-350.

See Also

Robust Adaptive Metropolis

(RAM)


Aspect Description
Acceptance Rate The user specifies the target acceptance rate, a* (23.4% is recommended). The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm that is best suited to models with a small to medium number of parameters, or a larger number of blocked parameters. The proposal covariance matrix must be solved, and this matrix grows with the number of parameters.
Difficulty This algorithm is relatively easy for a beginner. It has few algorithm specifications, and these are easy to specify. However, since it is adaptive, the user must regard diminishing adaptation.
Final Algorithm? User Discretion. The RWM algorithm is recommended as the final algorithm, though RAM may be used if diminishing adaptation occurs and adaptation ceases effectively.
Proposal Blockwise or Multivariate.


The Adaptive Metropolis (AM) and Adaptive-Mixture Metropolis (AMM) algorithms adapt the scale of the proposal distribution to attain a theoretical acceptance rate. However, these algorithms are unable to adapt to the shape of the target distribution. The Robust Adaptive Metropolis (RAM) algorithm estimates the shape of the target distribution and simultaneously coerces the acceptance rate (Vihola 2011). If the acceptance probability, α, is less (or greater) than an acceptance rate target, α∗, then the proposal distribution is shrunk (or expanded). Matrix S is computed as a rank one Cholesky update. Therefore, the algorithm is computationally efficient up to a relatively high dimension. The AM and AMM algorithms require a burn-in period prior to adaptation, so that these algorithms can adapt to the sample covariance. RAM does not require a burn-in period prior to adaptation. RAM allows the user the option of using the traditional normally-distributed proposals, or t-distributed proposals for heavier-tailed target densities. Unlike AM and AMM, RAM can cope with targets having arbitrarily heavy tails, and handles multimodal targets better than AM. The user is still assumed to know and specify the target acceptance rate.

RAM has five algorithm specifications: alpha.star is the target acceptance rate, B optionally accepts a list of blocked parameters and defaults to NULL, Dist is the target distribution as either “N” for normal or “t” for the Student t with 5 degrees of freedom, gamma accepts a scalar in the interval (0.5,1] and controls the decay of adaptation (0.66 is recommended), and n is the number of previous iterations. RAM adapts only when the variance-covariance matrix is positive-definite.

The advantages of RAM over AMM are that RAM does not require a burn-in period before it can begin to adapt, RAM is more likely to better handle multimodal or heavy-tailed targets, RAM also adapts to the shape of the target distributions, and attempts to coerce the acceptance rate. The advantages of RAM over Adaptive Metropolis-within-Gibbs (AMWG) are that RAM takes correlations into account, and is much faster to update each iteration. The disadvantage of RAM compared to AMWG is that more information must be learned in the covariance matrix to adapt properly, and frequent adaptation may be desirable, but slow. If RAM is used for adaptation, then the final, non-adaptive algorithm should be Random-Walk Metropolis (RWM).

References

  • Vihola M (2011). “Robust Adaptive Metropolis Algorithm with Coerced Acceptance Rate.” In Forthcoming (ed.), Statistics and Computing, p. 1-12. Springer, Netherlands.

See Also

Random Dive Metropolis-Hastings

(RDMH)


Aspect Description
Acceptance Rate The optimal acceptance rate is 44%, and is based on the univariate normality of each marginal posterior distribution. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm that is especiallly useful with multimodal or fat-tailed target distributions.
Difficulty This algorithm is easy for a beginner because it is fully automatic; tuning is not required.
Final Algorithm? Yes.
Proposal Componentwise.


Random Dive Metropolis-Hastings (RDMH) was introduced by Dutta (2012). RDMH is a componentwise Metropolis-Hastings algorithm in which the proposal is the product or ratio of a current parameter and a random quantity that does not require any tuning parameters. The random quantity is in the interval (-1,1). Although RDMH is a very simple algorithm, it has excellent convergence and mixing properties. However, if it is reasonable that the origin (0) has positive probability, then the model should be reparameterized, because RDMH fails in the obscure case when the origin has positive probability (which can arise if the state-space is the set of integers).

RDMH does not have any algorithm specifications.

RDMH allows the proposed state to be far away from the current state, and yet has a good acceptance rate. Therefore, RDMH can explore the state space faster than many other MCMC algorithms.

RDMH is geometrically ergodic for a class of densities that is much larger than most other MCMC algorithms. RDMH excels at representing target densities that are multimodal or fat-tailed, and has been compared with Gibbs sampling (Gibbs), Metropolis-Adjusted Langevin Algorithm (MALA), and Metropolis-within-Gibbs (MWG). Each of these other algorithms became stuck at local modes in multimodal target densities with large distances between the modes. RDMH explored the multimodal densities, or the fat tails, with ease.

As a componentwise algorithm, the model specification function is evaluated a number of times equal to the number of parameters, per iteration. The order of the parameters for updating is randomized each iteration (random-scan), as opposed to sequential updating (deterministic-scan).

The advantages of RDMH over MWG are that RDMH does not require tuning, better explores multimodal and fat-tailed target distributions, is better able to find acceptable proposals that are distant, and that it may therefore explore the state-space faster. Compared to multivariate MCMC algorithms, RDMH shares common disadvantages that it is slower per iteration, and correlated parameters are not taken into account. Since RDMH is not adaptive, it is suitable as a final algorithm.

References

  • Dutta S (2012). “Multiplicative Random Walk Metropolis-Hastings on the Real Line.” Sankhya B, 74(2), 315-342.

See Also

Refractive Sampler

(Refractive)


Aspect Description
Acceptance Rate The optimal acceptance rate is between 0.6-0.7.
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is moderately difficult because it requires tuning of the number of steps, as well as the step size. However, the step size may be tuned automatically with the adaptive version, leaving only the number of steps to be tuned by the user.
Final Algorithm? Yes, if non-adaptive, or no if adaptive.
Proposal Multivariate.


Refractive sampling was introduced by Boyles and Welling (2012) as an alternative to Hamiltonian Monte Carlo (HMC) and the No-U-Turn Sampler (NUTS). While the refractive sampler, HMC, and NUTS all use partial derivatives, the refractive sampler uses partial derivatives only for direction, not magnitude. The partial derivatives always point toward the side with the higher index of refraction. Like HMC and unlike NUTS, the refractive sampler requires tuning.

The Refractive algorithm has four algorithm specifications: Adaptive, the number m of steps to take per iteration, step size w, and r is the ratio between indices of refraction. The Adaptive argument does not appear in Boyles and Welling (2012). When Adaptive is less than the number of iterations, an optional Robbins-Monro stochastic approximation of Garthwaite et al. (2010) is applied to step size w. All specifications must be scalar, and it is recommended that r=1.3. The number of steps m is similar to the number of leapfrog steps L in HMC, and step size w is similar to epsilon. Typically, as the number of parameters increases, it is often better for m to be larger and w smaller.

Compared to other MCMC algorithms, a higher posterior density is often found. An advantage over NUTS is that iterative speed is consistent. An advantage over HMC is that Refractive is easier to tune. When Adaptive is greater than the number of iterations, the Refractive sampler is not adaptive and is suitable as a final algorithm.

References

  • Boyles LB and Welling M (2012). “Refractive Sampling.” http://www.ics.uci.edu/~lboyles/publications.html
  • Garthwaite P, Fan Y, Sisson S (2010). “Adaptive Optimal Scaling of Metropolis-Hastings Algorithms Using the Robbins-Monro Process.”

See Also

Reversible-Jump

(RJ)


Aspect Description
Acceptance Rate The optimal acceptance rate is unknown (not for reversible-jump in general, but for the CHARM algorithm within this version of reversible-jump). As such, the observed acceptance rate may be suitable in the interval [10%,90%].
Applications This version of reversible-jump is intended for variable selection and Bayesian Model Averaging (BMA).
Difficulty This algorithm is difficult for a beginner. There are several algorithm specifications. Convergence will be difficult to assess.
Final Algorithm? Yes.
Proposal Componentwise.


Reversible-jump Markov chain Monte Carlo (RJMCMC) was introduced by Green (1995) as an extension to MCMC in which the dimension of the model is uncertain and to be estimated. Reversible-jump belongs to a family of trans-dimensional algorithms, and it has many applications, including variable selection, model selection, mixture component selection, and more. The RJ algorithm, here, is one of the simplest possible implementations and is intended for variable selection and Bayesian Model Averaging (BMA).

The Componentwise Hit-And-Run Metropolis (CHARM) algorithm was selected in LaplacesDemon to be extended with reversible-jump. CHARM was selected because it does not have tuning parameters, it is not adaptive (which simplifies things with RJ), and it performs well. Even though it is a componentwise algorithm, it does not evaluate all potential predictors each iteration, but only those that are included. This novel combination is Reversible-Jump (RJ) with Componentwise Hit-And-Run Metropolis (CHARM). The optimal acceptance rate, and a good suggested acceptance rate range, are unknowns.

RJ proceeds by making two proposals during each iteration. First, a within-model move is proposed. This means that the model dimension does not change, and the algorithm proceeds like a traditional CHARM algorithm. Next, a between-models move is proposed, where a selectable parameter is sampled, and its status in the model is changed. If this selected parameter is in the current model, then RJ proposes a model that excludes it. If this selected parameter is not in the current model, then RJ proposes a model that includes it. RJ also includes a user-specified binomial prior distribution for the expected size of the model (the number of included, selectable parameters), as well as user-specified prior probabilities for the inclusion of each of the selectable parameters.

Behind the scenes, RJ keeps track of the most recent non-zero value for each selectable parameter. If a parameter is currently excluded, then its value is currently set to zero. When this parameter is proposed to be included, the most recent non-zero value is used as the basis of the proposal, rather than zero. In this way, an excluded parameter does not have to travel back toward its previously included density, which may be far from zero. However, if RJ begins updating after a previous run had ended, then it will not begin again with this most recent non-zero value. Please keep this in mind with this implementation of RJ.

RJ has five algorithm specifications. bin.n is the scalar size parameter of the binomial prior distribution for model size, and is the maximum size that RJ will explore. bin.p is the scalar probability parameter of the binomial prior distribution for model size, and the mean or median expected model size is bin.n x bin.p. parm.p is a vector of parameter-inclusion probabilities that must be equal in length to the number of initial values. selectable is a vector of indicators (0 or 1) that indicate whether or not a parameter is selectable by reversible-jump. When an element is zero, it is always in the model. Finally, the selected vector contains indicators of whether or not each parameter is in the model when RJ begins to update. All of these specifications must receive an argument with exactly that name (such as bin.n=bin.n, for the Consort function to recognize it, with the exception of the selected specification.

RJ provides a sampler-based alternative to variable selection, compared with the Bayesian Adaptive Lasso (BAL) or Stochastic Search Variable Selection (SSVS), as two of many other approaches. Examples of BAL and SSVS are in the Examples vignette. Advantages of RJ are that it is easier for the user to apply to a given model than writing the variable-selection code into the model, and RJ requires fewer parameters, because variable-inclusion is handled by the specialized sampler, rather than the model specification function. A disadvantage of RJ is that other methods allow the user to freely change to other MCMC algorithms, if the current algorithm is unsatisfactory.

References

  • Green P (1995). “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination.” Biometrika, 82, 711-732.

See Also

Reflective Slice Sampler

(RSS)


Aspect Description
Acceptance Rate The acceptance rate is 1.
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is difficult to tune. There are two tuning parameters: the number of steps to take, and the step size.
Final Algorithm? Yes.
Proposal Multivariate.


Introduced in Neal (1997), the Reflective Slice Sampler (RSS) is a multivariate slice sampler that uses partial derivatives and reflects into a slice interval. As described in Neal (2003), each iteration, direction and momentum within the slice interval are determined randomly, partial derivatives are taken once, and a number of steps is taken according to a step size. The distance of each step is the product of the step size and momentum. Direction is changed by reflecting off the boundaries of the slice interval. The reflected momentum direction is a function of the incident momentum and the partial derivatives. Random-walk behavior is suppressed, because a large number of steps may be taken in the same direction. The acceptance rate is 100%.

RSS has two algorithm specifications: m and w. Each iteration, m steps are taken with step size w. While m must be a scalar, w may be a scalar or vector equal in length to the number of parameters. A step size that is too small is inefficient, but too small is better than too large, which can entirely avoid the target distribution.

RSS is difficult to tune, but is consistent in time per iteration. Neal (2003) states that Hamiltonian Monte Carlo (HMC) performs better than RSS when both HMC and RSS are optimally tuned. Since RSS is not adaptive, it may be used as a final algorithm.

References

  • Neal R (1997). “Markov Chain Monte Carlo Methods Based on Slicing the Density Function.” Technical Report, University of Toronto.
  • Neal R (2003). “Slice Sampling (with Discussion).” Annals of Statistics, 31(3), 705-767.

See Also

Random-Walk Metropolis

(RWM)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm that is best suited to models with a small to medium number of parameters. The proposal covariance matrix must be solved, and this matrix grows with the number of parameters.
Difficulty This algorithm is moderately difficult for a beginner because the proposal covariance matrix must be tuned.
Final Algorithm? Yes.
Proposal Blockwise or Multivariate.


Random-Walk Metropolis (RWM) is a multivariate extension of Metropolis-within-Gibbs (MWG). Multiple parameters usually exist, and therefore correlations may occur between the parameters. Many MCMC algorithms attempt to estimate multivariate proposals, usually taking correlations into account through a covariance matrix. Each iteration, a multivariate proposal is made by taking a draw from a multivariate normal distribution, given a proposal covariance matrix.

RWM does not have any required algorithm specifications, though blockwise sampling may be specified with B, which accepts a list of proposal covariance matrices equal in length to the number of blocks. By default, blockwise sampling is not performed, so all parameters are updated with one multivariate proposal.

Given the number of dimensions (K) or parameters, the optimal scale of the proposal covariance, also called the jumping kernel, has been reported as \(2.4^2/K\) based on the asymptotic limit of infinite-dimensional Gaussian target distributions that are independent and identically-distributed (Gelman, Roberts, and Gilks 1996). In applied settings, each problem is different, so the amount of correlation varies between variables, target distributions may be non-Gaussian, the target distributions may be non-IID, and the scale should be optimized. There are algorithms in statistical literature that attempt to optimize this scale, such as the Robust Adaptive Metropolis (RAM) algorithm.

There have been numerous methods introduced for tuning the proposal covariance matrix. Many adaptive MCMC algorithms such as Adaptive Metropolis (AM), Adaptive-Mixture Metropolis (AMM), and RAM will tune the proposal covariance matrix. Alternatively, initially specify an identity matrix with small-scale diagonal values such as 1e-3 as the proposal covariance matrix, update with RWM, and then update again but this time with the observed covariance of the historical samples. Done repeatedly, this may arrive at an acceptable proposal covariance matrix, suitable for longer and more serious updates.

Since RWM is not adaptive, it is suitable as a final algorithm.

References

  • Gelman A, Roberts G, Gilks W (1996). “Efficient Metropolis Jumping Rules.” Bayesian Statistics, 5, 599-608.
  • Haario H, Saksman E, Tamminen J (2001). “An Adaptive Metropolis Algorithm.” Bernoulli, 7(2), 223-242.
  • Roberts G, Rosenthal J (2009). “Examples of Adaptive MCMC.” Computational Statistics and Data Analysis, 18, 349-367.
  • Vihola M (2011). “Robust Adaptive Metropolis Algorithm with Coerced Acceptance Rate.” In Forthcoming (ed.), Statistics and Computing, pp. 1-12. Springer, Netherlands.

See Also

Sequential Adaptive Metropolis-within-Gibbs

(SAMWG)


Aspect Description
Acceptance Rate The optimal acceptance rate is 44%, and is based on the univariate normality of each marginal posterior distribution. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This algorithm is applicable with state-space models (SSMs), including dynamic linear models (DLMs).
Difficulty This algorithm is relatively easy for a beginner. It has few algorithm specifications. However, since it is adaptive, the user must regard diminishing adaptation.
Final Algorithm? User Discretion. The SMWG algorithm is recommended as the final algorithm, though SAMWG may be used if diminishing adaptation occurs and adaptation ceases effectively.
Proposal Componentwise.


The Sequential Adaptive Metropolis-within-Gibbs (SAMWG) algorithm is for state-space models (SSMs), including dynamic linear models (DLMs). It is identical to the Adaptive Metropolis-within-Gibbs (AMWG) algorithm, except with regard to the order of updating parameters (and here, sequential does not refer to deterministic-scan). Parameters are grouped into two blocks: static and dynamic. At each iteration, static parameters are updated first, followed by dynamic parameters, which are updated sequentially through the time-periods of the model. The order of the static parameters is randomly selected at each iteration, and if there are multiple dynamic parameters for each time-period, then the order of the dynamic parameters is also randomly selected. The SAMWG algorithm is adapted from Geweke and Tanizaki (2001) for LaplacesDemon. The SAMWG is a single-site update algorithm that is more efficient in terms of iterations, though convergence can be slow with high intercorrelations in the state vector (Fearnhead 2011). If SAMWG is used for adaptation, then the final, non-adaptive algorithm should be Sequential Metropolis-within-Gibbs (SMWG).

References

  • Fearnhead P (2011). “MCMC for State-Space Models.” In S Brooks, A Gelman, G Jones, M Xiao-Li (eds.), Handbook of Markov Chain Monte Carlo, p. 513-530. Chapman & Hall, Boca Raton, FL.
  • Geweke J, Tanizaki H (2001). “Bayesian Estimation of State-Space Models Using the Metropolis-Hastings Algorithm within Gibbs Sampling.” Computational Statistics and Data Analysis, 37, 151-170.

See Also

Stochastic Gradient Langevin Dynamics

(SGLD)


Aspect Description
Acceptance Rate The acceptance rate is 100%.
Applications This is a widely applicable, general-purpose algorithm that is specifically designed for big data.
Difficulty This algorithm is easy for a beginner, though the step size must be specified and tuned.
Final Algorithm? Yes.
Proposal Multivariate.


The Stochastic Gradient Langevin Dynamics (SGLD) algorithm of Welling and Teh (2011) is the first MCMC algorithm designed specifically for big data. Traditional MCMC algorithms require the entire data set to be included in the model evaluation each iteration. In contrast, SGLD reads and processes only a small, randomly selected batch of records each iteration. In addition to saving computation time, the entire data set does not need to be loaded into memory at once.

SGLD expands the stochastic gradient descent optimization algorithm to include Gaussian noise with Langevin dynamics. The multivariate proposal is merely the vector of current values plus a step size times the gradient, plus Gaussian noise. The scale of the Gaussian noise is determined by the step size, ε.

SGLD has five algorithm specifications: epsilon or ε is the step size, file accepts the quoted name of a .csv file that is the big data set, Nr is the number of rows in the big data set, Nc is the number of columns in the big data set, and size is the number of rows to be read and processed each iteration.

Since SGLD is designed for big data, the entire data set is not included in the Data list, but one small batch must be included and named \(X\). All data must be included. For example, both the dependent variable \(y\) and design matrix \(X\) in linear regression are included. The requirement for the small batch to be in Data is so that numerous checks may be passed after LaplacesDemon is called and before the SGLD algorithm begins. Each iteration, SGLD uses the scan function, without headers, to read a random block of rows from, say, X.csv, stores it in Data$X, and passes it to the Model specification function. The Model function must differ from the other examples found in the LaplacesDemon package in that multiple objects, such as \(X\) and \(y\) must be read from Data$X, where usually there is both Data$X and Data$y.

The user tunes SGLD with step size ε via the epsilon argument, which accepts either a scalar for a constant step size or a vector equal in length to the number of iterations for an annealing schedule. The step size must remain in the interval (0,1). The user may define an annealing schedule with any function desired. Examples are given in Welling and Teh (2011) of decreasing schedules, as well as for using a constant. When ε = 0, both the stochastic gradient and Langevin dynamics components of the equation are reduced to zero and the algorithm will not move. When ε is too large, degenerate results occur. A good recommendation seems to be to begin with ε set to 1/Nr. From testing on numerous examples, it seems preferable to perform several short runs and experiment with a constant, rather than using a decreasing schedule, but your mileage may vary.

The SGLD algorithm presented here is the simplest one, which is equation 4 in Welling and Teh (2011). Other components were also proposed such as a preconditioning matrix and covariance matrix, though these are not currently included here.

SGLD does not include a Metropolis step for acceptance and rejection, so the acceptance rate is 100%. SGLD is slower than a componentwise algorithm for two reasons: first it must read data from an external file each iteration, and second, gradients with numerical differencing are computationally expensive, requiring many model evaluations per iteration. At least Nr / size iterations are suggested. SGLD has great promise for the application of MCMC to massive data sets.

References

  • Welling M, Teh, Y (2011). “Bayesian Learning via Stochastic Gradient Langevin Dynamics.” Proceedings of the 28th International Conference on Machine Learning (ICML), 681-688.

See Also

Slice Sampler

(Slice)


Aspect Description
Acceptance Rate The acceptance rate is 1.
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is easy for a beginner. There are several algorithm specifications, and the defaults are recommended for automatic use with continuous parameters.
Final Algorithm? Yes.
Proposal Componentwise.


The origin of slice sampling dates back to Besag and Green (1993), and the current algorithm was introduced in Neal (1997) and improved in Neal (2003). In slice sampling, a distribution is sampled by sampling uniformly from the region under the plot of its density function. Here, slice sampling uses two phases as follows. First, an interval is created for the slice, and second, rejection sampling is performed within this interval.

Slice has five algorithm specifications: B defaults to NULL in which case blockwise updating is not performed, or B accepts a list in which each component contains a vector that indicates the position number of parameters to be sampled per block. Within each block, the order of evaluations is randomized each iteration. The Bounds specification accepts either a vector of length two for the lower and upper bound, or a list of vectors, one for each block. The m specification is the maximum number of steps (an integer in [1, ∞], and defaults to Inf), and may be a scalar or a list with a scalar per block. The Type specification accepts the following strings: “Continuous”, “Nominal”, or “Ordinal”. This string may be a scalar, or a list of scalars, one for each block. This specification refers to the type of parameter, which may be continuous or discrete. Discrete slice sampling is performed either for nominal or ordinal parameters. Finally, the w specification accepts a scalar step size (in (0, ∞), and defaults to 1), or accepts a list of scalars, one for each block. Ideally, w is 3 standard deviations of the target distribution for continuous parameters, and is usually 1 for discrete parameters. All parameters in a block receive the same specifications, but specifications may differ by block.

The Slice algorithm implemented here is componentwise, and the algorithm for continuous parameters is based on figures 3 and 5 in Neal (2003), in which the slice is replaced with an interval \(I\) that contains most of the slice. For each parameter, an interval I is created and expanded by the “stepping out” procedure with step size w until the interval is larger than the slice, and the number of steps m may be limited by the user. The original slice sampler is inappropriate for the unbounded interval (-∞, ∞), and this improved version allows this unbounded interval by replacing the slice with the created interval I. This algorithm adaptively changes the scale, though it is not an adaptive algorithm in the sense that it retains the Ma`

Blockwise sampling of a componentwise algorithm allows different specifications per block, and allows the user to control the order of evaluations, because blocks are processed in order, though parameters per block are selected in a randomized order.

The lower and upper bounds of interval \(I\) default to (-∞, ∞), and may be constrained for constrained parameters. Once the interval is constructed, a proposal is drawn randomly from within the interval until the proposal is accepted, and the interval is otherwise shrunk. The acceptance rate of this Slice algorithm is 1, though multiple model evaluations occur per iteration.

When this Slice sampler uses the default specifications and begins far from the target distributions, the time per iteration should decrease as the algorithm approaches the target distributions. Considerable time may be spent in the first iterations. One strategy may be to limit m.

This componentwise Slice algorithm has been noted to be more efficient than Metropolis updates, may be easier to implement than Gibbs sampling (Gibbs), and is attractive for routine and automated use (Neal, 2003). An adaptive version is AFSS. As a componentwise algorithm, Slice is slower per iteration than algorithms that use multivariate proposals. Multivariate slice samplers have been proposed, such as ESS, OHSS, and UESS. Since Slice is not an adaptive algorithm, it is acceptable as a final algorithm.

References

  • Besag J and Green PJ (1993). “Spatial Statistics and Bayesian Computation.” Journal of the Royal Statistical Society, Series B, 55, 25-37.
  • Neal R (1997). “Markov Chain Monte Carlo Methods Based on Slicing the Density Function.” Technical Report, University of Toronto.
  • Neal R (2003). “Slice Sampling (with Discussion).” Annals of Statistics, 31(3), 705-767.

See Also

Sequential Metropolis-within-Gibbs

(SMWG)


Aspect Description
Acceptance Rate The optimal acceptance rate is 44%, and is based on the univariate normality of each marginal posterior distribution. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This algorithm is applicable with state-space models (SSMs), including dynamic linear models (DLMs).
Difficulty This algorithm is relatively easy for a beginner when the proposal variance has been tuned with the SAMWG algorithm. Otherwise, it may be tedious for the user to tune the proposal variance.
Final Algorithm? Yes.
Proposal Componentwise.


The Sequential Metropolis-within-Gibbs (SMWG) algorithm is the non-adaptive version of the Sequential Adaptive Metropolis-within-Gibbs (SAMWG) algorithm, and is used for final sampling of state-space models (SSMs).

References

  • none

See Also

Tempered Hamiltonian Monte Carlo

(THMC)


Aspect Description
Acceptance Rate The optimal acceptance rate is 65% when L > 1, or 57.4% when L = 1. The observed acceptance rate may be suitable in the interval [60%,70%] when L > 1, or [40%,80%] when L = 1.
Applications This is a widely applicable, general-purpose algorithm that is best suited to models with a small number of parameters. The number of model evaluations per iteration increases with the number of parameters.
Difficulty This algorithm is difficult for a beginner. It has a several algorithm specifications, and these are difficult to tune.
Final Algorithm? Yes.
Proposal Multivariate. Proposals are multivariate only in the sense that proposals for multiple parameters are generated at once. Each iteration involves numerous proposals, due to partial derivatives and L.


The Tempered Hamiltonian Monte Carlo (THMC) algorithm is an extension of Hamiltonian Monte Carlo (HMC) to include another algorithm specification: Temperature, which must be positive. When greater than 1, the algorithm should explore more diffuse distributions, and may be helpful with multimodal distributions.

THMC has four algorithm specifications: step-size epsilon, the number L of leapfrog steps, an optional mass matrix M, and Temperature. Algorithm specifications are the same as for HMC, with the exception of temperature, which is described below.

There are a variety of ways to include tempering in HMC, and this algorithm, named here as THMC, uses “tempered trajectory”, as described by Neal (2011). When L > 1 and during the first half of the leapfrog steps, the momentum is increased (heated) by multiplying it by √T, where T is temperature, each leapfrog step. In the last half of the leapfrog steps, the momentum decreases (is cooled down) by dividing it by √T. The momentum is largest in the middle of the leapfrog steps, where mode-switching behavior becomes most likely to occur. This preserves the trajectory, εL.

As with HMC, THMC is a difficult algorithm to tune. Since THMC is non-adaptive, it is sufficient as a final algorithm.

References

  • Neal R (2011). “MCMC for Using Hamiltonian Dynamics.” In S Brooks, A Gelman, G Jones, M Xiao-Li (eds.), Handbook of Markov Chain Monte Carlo, p. 113-162. Chapman & Hall, Boca Raton, FL.

See Also

t-walk

(twalk)


Aspect Description
Acceptance Rate The optimal acceptance rate is based on the multivariate normality of the marginal posterior distributions, and ranges from 44% for one parameter to 23.4% for five or more parameters. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is easy for a beginner. It has few algorithm specifications, and these are recommended to remain at the default values. This algorithm requires an additional vector of initial values, but is otherwise fully automatic.
Final Algorithm? Yes.
Proposal Multivariate. Proposals are multivariate only in the sense that proposals for multiple parameters are generated at once. However, proposals are not generated with a multivariate distribution and a proposal covariance matrix. Proposals are generated for only a subset of parameters at each iteration, and for only one of two sets of initial values. See below.


The t-walk (twalk) algorithm of Christen and Fox (2010) is a general-purpose MCMC algorithm that requires no tuning, is scale-invariant, is technically non-adaptive (but self-adjusting), and can sample from target distributions with arbitrary scale and correlation structures. A random subset of one of two vectors is moved around the state-space to influence one of two chains, per iteration.

twalk has four algorithm specifications: SIV is a vector secondary initial values and the default is NULL, n1 affects the size of the subset of each set of points to adjust and the default is 4, at affects the traverse move and the default is 6, and aw affects the walk move and the default is 1.5. The vector of secondary initial values may be left to its default, NULL, in which case it is generated with the GIV function and it is recommended that PGF is specified. SIV should be similar to, but unique from, Initial.Values. The secondary initial values are used for a second chain, which is merely used here to help the first chain, and its results are not reported. The n1 specification relates to the number of parameters. For example, if n1=4 and a model has \(J=100\) parameters, then there is a \(p(0.04) = 4 / 100\) probability that a point is moved that affects each parameter, though this affects only one of two chains per iteration. Put another way, there is a 4% chance that each parameter changes each iteration, and a 50% chance each iteration that the observed chain is selected. The traverse specification argument, at, affects the traverse move, which helps when some parameters are highly correlated, and the correlation structure may change through the state-space. The traverse move is associated with an acceptance rate that decreases as the number of parameters increases, and is the reason that n1 is used to select a subset of parameters each iteration. Finally, the walk specification argument, aw, affects the walk move. The authors recommend keeping these specification arguments in n1 in [2,20], at‘in [2,10], and aw in [0.3,2]. The ’hop’ and ‘blow’ moves do not have specifications, but help with multimodality, ensure irreducibility, and prevent the two chains from collapsing together. The hop move is centered on the primary chain, and the blow move is centered on the secondary chain.

The authors have provided the t-walk algorithm in R code as well as other languages. It is called the “t-walk” for “traverse” or “thoughtful” walk, as opposed to Random-Walk Metropolis (RWM). Where adaptive algorithms are designed to adapt to the scale and correlation structure of target distributions, the t-walk is invariant to this structure. The step-size and direction continuously “adjust” to the local structure. The t-walk uses one of four proposal distributions or ‘moves’ per iteration, with the following probabilities: traverse (p=0.4918), walk (p=0.4918), hop (p=0.0082), and blow (p=0.0082).

Testing in LaplacesDemon with the default specifications suggests the t-walk is very promising, but due to the subset of proposals, it is important to note that the reported acceptance rate indicates the proportion of iterations in which moves were accepted, but that only a subset of parameters changed, and only one chain is selected each iteration. Therefore, a user who updates a high-dimensional model should find that parameter values change much less frequently, and this requires more iterations.

The main advantage of t-walk, like the Hit-And-Run Metropolis (HARM) and Metropolis-within-Gibbs (MWG) families, over multivariate adaptive algorithms such as Adaptive-Mixture Metropolis (AMM) and Robust Adaptive Metropolis (RAM) is that t-walk does not adapt to a proposal covariance matrix, which can be limiting in random-access memory (RAM) and other respects in large dimensions, making t-walk suitable for high-dimensional exploration. Other advantages are that t-walk is invariant to all but the most extreme correlation structures, does not need to burn-in before adapting since it technically is non-adaptive (though it ‘adjusts’ continuously), and continuous adjustment is an advantage, so Periodicity does not need to be adjusted. The advantage of t-walk over componentwise algorithms such as the MWG family, is that the model specification does not have to be evaluated a number of times equal to the number of parameters in each iteration, allowing the t-walk algorithm to iterate significantly faster in high dimension. The disadvantage of t-walk, compared to these algorithms, is that more iterations are required because only a subset of parameters can change at each iteration (though it still updates twice the number of parameters per iteration, on average, than the MWG family).

Since twalk is technically non-adaptive, it is suitable as a final algorithm.

References

  • Christen J, Fox C (2010). “A General Purpose Sampling Algorithm for Continuous Distributions (the t-walk).” Bayesian Analysis, 5(2), 263-282.

See Also

Univariate Eigenvector Slice Sampler

(UESS)


Aspect Description
Acceptance Rate The acceptance rate is 1.
Applications This is a widely applicable, general-purpose algorithm.
Difficulty This algorithm is easy to use.
Final Algorithm? Yes, when non-adaptive.
Proposal Multivariate.


The Univariate Eigenvector Slice Sampler (UESS) is an adaptive algorithm that approximates the sample covariance matrix in the same manner as Adaptive-Mixture Metropolis (AMM). Described as “Univar Eigen” in Thompson (2011), UESS is a multivariate variation of slice sampling (Slice) that makes univariate updates along the eigenvectors of the sample covariance matrix.

UESS has four algorithm specifications: A is the number of adaptive iterations, B accepts a list of blocks and defaults to NULL, m is the number of steps, and n is the number of previous iterations, and defaults to zero. Each iteration, a slice interval is estimated with up to m steps. The default is m=100 steps. The step size is affected by the eigenvectors of the sample covariance matrix. The number of previous iterations, if any, is used to weight the sample covariance matrix.

The contours of the target distribution are approximately ellipsoidal and the eigenvectors of its covariance coincide roughly with the axes of these ellipsoids, provided that the shape of the target distribution is approximately convex. Steps taken along these eigenvectors include steps along the long axes of a slice.

Each iteration, there is a 5% probability that a non-adaptive step is taken, rather than an adaptive step. The first ten iterations or so are non-adaptive. Eigenvectors are estimated no more than every tenth iteration. When adaptive, the sample covariance matrix is updated each iteration.

Once the slice interval is estimated, a sample is drawn uniformly with rejection sampling from within this interval. If rejected, then the interval is shrunk until an acceptable sample is drawn. UESS has a 100% acceptance rate.

The time per iteration varies, since building the slice area requires up to m steps, and rejection sampling often requires more than one evaluation. Although UESS is a combination of the AMM and Slice algorithms, it usually performs as well or better than either. When UESS is adaptive, it is not suitable as a final algorithm.

References

See Also

Updating Sequential Adaptive Metropolis-within-Gibbs

(USAMWG)


Aspect Description
Acceptance Rate The optimal acceptance rate is 44%, and is based on the univariate normality of each marginal posterior distribution. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This algorithm is applicable with state-space models (SSMs), including dynamic linear models (DLMs).
Difficulty This algorithm is relatively easy for a beginner. It has few algorithm specifications. However, since it is adaptive, the user must regard diminishing adaptation.
Final Algorithm? User Discretion. The USMWG algorithm is recommended as the final algorithm, though USAMWG may be used if diminishing adaptation occurs and adaptation ceases effectively.
Proposal Componentwise.


The Updating Sequential Adaptive Metropolis-within-Gibbs (USAMWG) is for state-space models (SSMs), including dynamic linear models (DLMs). After a model is fit with Sequential Adaptive Metropolis-within-Gibbs (SAMWG) and Sequential Metropolis-within-Gibbs (SMWG), and information is later obtained regarding the first future state predicted by the model, the USAMWG algorithm may be applied to update the model given the new information. In SSM terminology, updating is filtering and predicting. This is more efficient than re-estimating the entire model each time new information is obtained.

For example, suppose a time-series of daily values was fit with the SAMWG algorithm. Let’s also suppose the last day in the model sample was a Monday, and that one-step ahead forecasts are produced, so the model predicted Tuesday. After the actual value for Tuesday is known, it may be included in the model. Using USAMWG, the latest Tuesday is filtered and Wednesday is forecast, while the days of the original model sample are not estimated again.

References

  • none

See Also

Updating Sequential Metropolis-within-Gibbs

(USMWG)


Aspect Description
Acceptance Rate The optimal acceptance rate is 44%, and is based on the univariate normality of each marginal posterior distribution. The observed acceptance rate may be suitable in the interval [15%,50%].
Applications This algorithm is applicable with state-space models (SSMs), including dynamic linear models (DLMs).
Difficulty This algorithm is relatively easy for a beginner when the proposal variance has been tuned with the USAMWG algorithm. Otherwise, it may be tedious for the user to tune the proposal variance.
Final Algorithm? Yes.
Proposal Componentwise.


The Updating Sequential Metropolis-within-Gibbs (USMWG) algorithm is the non-adaptive version of the USAMWG algorithm, and is used for final sampling when updating state-space models (SSMs).

For example, suppose a time-series of daily values was fit with the SAMWG algorithm, and SMWG may have been used for final samples. Let’s also suppose the last day in the model sample was a Monday, and that one-step ahead forecasts are produced, so the model predicted Tuesday. After the actual value for Tuesday is known, it may be included in the model. Using USAMWG, the latest Tuesday is filtered and Wednesday is forecast, while the days of the original model sample are not estimated again. USMWG may then be used for final samples.

References

  • none

See Also