Hamiltonian Monte Carlo

Stat 221, Lecture 15

@snesterko

Where we are

Reference for the lecture

HMC: notation

  • We want to simulate \\( \theta \equiv q \\) from \\( p(\theta \given Y) = \pi(\theta) L(\theta \given Y)\\).
  • Augment \\( \theta \\) space - each element of \\( q \\) gets a corresponding "brother" \\( p \\).
  • Potential energy function (linked to the posterior) $$U(q) = -\rr{log} \left( p (q \given Y)\right)$$
  • Kinetic energy function (negative log-density of momentum variables) \\( K(p) \\).

Main intuition

In the Hamiltonian system, the sum of kinetic enery \\( P \\) and potential energy \\( U \\) is constant.

(we work on the log scale)

HMC: how it works

HMC algorithm

Two steps:

  1. Draw new momentum variables \\( p^* \\) from the distribution (usually Gauggian) defined by the kinetic energy function.
  2. Propose a new state \\( q^* \\) for position variables \\( q \\) by simulating Hamiltonian dynamics given the momentum \\( p \\) and the current value of \\(q\\). Accept with probability

$$\rr{min}\left[ 1, \rr{exp}\left( -U(q^*) + U(q) - K(p^*) + K(p) \right)\right]$$

The Hamiltonian

$$\begin{align} {dq_i \over dt} & = {\partial H \over \partial p_i} \\ {dp_i \over dt} & = -{\partial H \over \partial q_i}\end{align}$$

Most commonly used form of \\( H \\): $$ H(q, p) = U(q) + K(p) $$

Kinetic and potential energies

Usual choice for the kinetic energy $$K(p) = p^T M^{-1} p / 2$$

What distribution does this correspond to?

Potential energy $$U(q) = -\rr{log} \left( p (q \given Y)\right)$$

Hamiltonian narrowed down

Given the defined \\( H \\) and \\( K \\), the Hamiltonian equations become the following:

$$\begin{align} {dq_i \over dt} & = \left[ M^{-1} p\right]_i \\ {dp_i \over dt} & = -{\partial U \over \partial q_i}\end{align}$$

A 1D example

$$H(q, p) = U(q) + K(p)$$

\\( U(q) = {q^2 \over 2}, K(p) = {p^2 \over 2} \\) - what are the pdfs?

Hamiltonian: $${dq \over dt} = p, \rr{ } {dp \over dt} = -q$$

Solution, for some \\(r \\) and \\(a \\):

$$q(t) = r\rr{cos}(a + t), \rr{ } p(t) = -r \rr{sin} (a+t)$$

Interpretation

  • The Hamiltonian trajectory travels the level set of the joint momentum-position density.

Main properties of Hamiltonian dynamics

  • Reversibility.
  • Conservation of the Hamiltonian.
  • Volume preservation.

Reversibility

  • Mapping \\( T_s \\) from state at time \\( t \\) to state at time \\( t + s \\) is one-to-one, and has an inverse ( \\( T_{-s} \\) ).
  • With the considered form of the Hamiltonian and \\( K(p) = K(-p) \\), \\( T_{-s} \\) can be obtained by negating \\( p \\), applying \\( T_s \\), and negating p again.
  • In the 1D example, \\( T_{-s} \\) is the counter-clockwise rotation.
  • This implies that MCMC using HMC dynamics leaves desired distribution invariant.

Conservation of the Hamiltonian

$$\begin{align}{dH \over dt} &= \sum_{i=1}^d \left[ {d q_i \over dt} {\partial H \over \partial q_i} + {d p_i \over dt} {\partial H \over \partial p_i}\right] \\ & = \sum_{i=1}^d \left[ {\partial H \over \partial p_i} {\partial H \over \partial q_i} - {\partial H \over \partial q_i} {\partial H \over \partial p_i} \right] = 0\end{align}$$

  • The trajectory travels the level set of the joint \\( qp\\) distribution.
  • This means probability 1 of accepting a move given the Hamiltonian is simulated exactly.

Volume preservation

Vector field zero divergence:

$$\begin{align}\sum_{i=1}^d & \left[ {\partial \over \partial q_i} {d q_i \over dt} + { \partial \over \partial p_i} {d p_i \over dt}\right] = \\ & \sum_{i=1}^d \left[ {\partial \over \partial q_i} {\partial H \over \partial p_i} + { \partial \over \partial p_i} {\partial H \over \partial q_i}\right] = 0\end{align}$$

Don't need to use the Jacobian in the acceptance probability.

The leapfrog method

$$\begin{align}p_i(t + \epsilon / 2) & = p_i(t) - (\epsilon / 2) {\partial U \over \partial q_i}q_i(t) \\ q_i(t + \epsilon) & = q_i(t) + \epsilon {p_i(t + \epsilon / 2) \over m_i} \\ p_i(t + \epsilon) &= p_i(t + \epsilon / 2) - {\epsilon \over 2} \cdot { \partial U \over \partial q_i} q_i(t + \epsilon)\end{align}$$

Here \\( K(p) = \sum_{i=1}^d { p_i^2 \over 2m_i } \\).

Example: homework

$$\begin{align} & l(B, \vec{Z}_i \given \{C_{ijk}\} ) = \rr{const} + \\ & \sum_{i=1}^I \left( \sum_{j \in J_i} \sum_{k \in K_{ij}} \left[ \vec{x}_{jk}' B - \rr{log} \sum_{l = 1}^{L_j} e^{ \vec{x}_{jl}' B} \right] \right) \cdot \vec{Z}_i \end{align}$$

\\( U = - \rr{log} l \\), \\( q = B \\).

How to implement

  • Sample code in Neal's review (p. 14).
  • Generic sample code for a vector case - ask TF's.

The leapfrog method

Detailed balance in HMC

$$P(A_i) T(B_j \given A_i) = P(B_j) T(A_i \given B_j)$$

\\( B_k\\) be the image of \\( A_k\\) after \\( L \\) leapfrog steps and negating the momentum.

\\( P \\) is the probability under the canonical distribution, \\( T(X \given Y ) \\) is the conditional probability of proposing and then accepting a move to region \\( X \\) if the current state is in region \\( Y \\).

Example: bivariate Normal

$$H(q, p) = q^T \Sigma^{-1} q / 2 + p^T p / 2$$

with $$\Sigma = \left(\begin{array} n1 & 0.98 \\ 0.98 & 1\end{array} \right)$$

What position distribution does this correspond to?

Trajectories

Trace plots

100D Normal

Further improvements to HMC

  • Use target Hessian to set the Mass matrix.
  • No U-turn sampler.
  • Other.

HMC and the Hessian

  • Compute the Hessian at a local mode of the posterior.
  • Effectively, this corresponds to "uncorrelating" the parameters.

Computational aspects of HMC

  • Number of steps and step size: \\( l \\) and \\( \epsilon \\).
  • Need to compute the posterior and gradient many times.
  • For good implementations of HMC, need to compute the Mass matrix.

Mass/trajectory approximation tradeoff

  • If the mass matrix is bad, need better leapfrog approximation.
  • What is better - good mass or crude approximation of Hamiltonian dynamics?

Extensions of HMC

  • Bounded parameters
  • Discrete parameters

Limitations

  • Still prone to local modes.
  • But if the modes are even somewhat connected, it's much better than MH!

Augmentation strategies

  • Hacker level
  • Parameter expansion - adding fixed parameters (PX-EM, vEM)
  • Data augmentation - adding random quantities (HMC, parallel/simulated tempering)
  • It's all data augmentation in the Bayesian paradigm

Announcements

  • T-shirt comps!
  • April 4 - final project second checkpoint deadline.
  • Set up time over email to check that final projects are on track.

Resources

Final slide

  • Next lecture: Desicion theory and estimation.