Hamiltonian Monte Carlo
Stat 221, Lecture 15
@snesterko
Where we are
Reference for the lecture
HMC: notation
- We want to simulate \\( \theta \equiv q \\) from \\( p(\theta \given Y) = \pi(\theta) L(\theta \given Y)\\).
- Augment \\( \theta \\) space - each element of \\( q \\) gets a corresponding "brother" \\( p \\).
- Potential energy function (linked to the posterior) $$U(q) = -\rr{log} \left( p (q \given Y)\right)$$
- Kinetic energy function (negative log-density of momentum variables) \\( K(p) \\).
Main intuition
In the Hamiltonian system, the sum of kinetic enery \\( P \\) and
potential energy \\( U \\) is constant.
(we work on the log scale)
HMC algorithm
Two steps:
- Draw new momentum variables \\( p^* \\) from the distribution
(usually Gauggian) defined by the kinetic energy function.
- Propose a new state \\( q^* \\) for position variables \\( q \\) by
simulating Hamiltonian dynamics given the momentum \\( p \\) and
the current value of \\(q\\). Accept with probability
$$\rr{min}\left[ 1, \rr{exp}\left( -U(q^*) + U(q) - K(p^*) + K(p) \right)\right]$$
The Hamiltonian
$$\begin{align} {dq_i \over dt} & = {\partial H \over \partial p_i} \\ {dp_i \over dt} & = -{\partial H \over \partial q_i}\end{align}$$
Most commonly used form of \\( H \\): $$ H(q, p) = U(q) + K(p) $$
Kinetic and potential energies
Usual choice for the kinetic energy $$K(p) = p^T M^{-1} p / 2$$
What distribution does this correspond to?
Potential energy $$U(q) = -\rr{log} \left( p (q \given Y)\right)$$
Hamiltonian narrowed down
Given the defined \\( H \\) and \\( K \\), the Hamiltonian equations
become the following:
$$\begin{align} {dq_i \over dt} & = \left[ M^{-1} p\right]_i \\ {dp_i \over dt} & = -{\partial U \over \partial q_i}\end{align}$$
A 1D example
$$H(q, p) = U(q) + K(p)$$
\\( U(q) = {q^2 \over 2}, K(p) = {p^2 \over 2} \\) - what are the pdfs?
Hamiltonian: $${dq \over dt} = p, \rr{ } {dp \over dt} = -q$$
Solution, for some \\(r \\) and \\(a \\):
$$q(t) = r\rr{cos}(a + t), \rr{ } p(t) = -r \rr{sin} (a+t)$$
Interpretation
- The Hamiltonian trajectory travels the level set of the
joint momentum-position density.
Main properties of Hamiltonian dynamics
- Reversibility.
- Conservation of the Hamiltonian.
- Volume preservation.
Reversibility
- Mapping \\( T_s \\) from state at time \\( t \\) to state
at time \\( t + s \\) is one-to-one, and has an inverse ( \\( T_{-s} \\) ).
- With the considered form of the Hamiltonian and \\( K(p) = K(-p) \\),
\\( T_{-s} \\) can be obtained by negating \\( p \\), applying \\( T_s \\),
and negating p again.
- In the 1D example, \\( T_{-s} \\) is the counter-clockwise rotation.
- This implies that MCMC using HMC dynamics leaves desired
distribution invariant.
Conservation of the Hamiltonian
$$\begin{align}{dH \over dt} &= \sum_{i=1}^d \left[ {d q_i \over dt} {\partial H \over \partial q_i} + {d p_i \over dt} {\partial H \over \partial p_i}\right] \\ & = \sum_{i=1}^d \left[ {\partial H \over \partial p_i} {\partial H \over \partial q_i} - {\partial H \over \partial q_i} {\partial H \over \partial p_i} \right] = 0\end{align}$$
- The trajectory travels the level set of the joint \\( qp\\) distribution.
- This means probability 1 of accepting a move given the Hamiltonian is simulated exactly.
Volume preservation
Vector field zero divergence:
$$\begin{align}\sum_{i=1}^d & \left[ {\partial \over \partial q_i} {d q_i \over dt} + { \partial \over \partial p_i} {d p_i \over dt}\right] =
\\ & \sum_{i=1}^d \left[ {\partial \over \partial q_i} {\partial H \over \partial p_i} + { \partial \over \partial p_i} {\partial H \over \partial q_i}\right] = 0\end{align}$$
Don't need to use the Jacobian in the acceptance probability.
The leapfrog method
$$\begin{align}p_i(t + \epsilon / 2) & = p_i(t) - (\epsilon / 2) {\partial U \over \partial q_i}q_i(t) \\
q_i(t + \epsilon) & = q_i(t) + \epsilon {p_i(t + \epsilon / 2) \over m_i} \\
p_i(t + \epsilon) &= p_i(t + \epsilon / 2) - {\epsilon \over 2} \cdot { \partial U \over \partial q_i} q_i(t + \epsilon)\end{align}$$
Here \\( K(p) = \sum_{i=1}^d { p_i^2 \over 2m_i } \\).
Example: homework
$$\begin{align}
& l(B, \vec{Z}_i \given \{C_{ijk}\} ) = \rr{const} + \\
& \sum_{i=1}^I \left( \sum_{j \in J_i} \sum_{k \in K_{ij}}
\left[ \vec{x}_{jk}' B -
\rr{log} \sum_{l = 1}^{L_j} e^{ \vec{x}_{jl}' B} \right] \right)
\cdot \vec{Z}_i
\end{align}$$
\\( U = - \rr{log} l \\), \\( q = B \\).
How to implement
- Sample code in Neal's review (p. 14).
- Generic sample code for a vector case - ask TF's.
The leapfrog method
Detailed balance in HMC
$$P(A_i) T(B_j \given A_i) = P(B_j) T(A_i \given B_j)$$
\\( B_k\\) be the image of \\( A_k\\) after \\( L \\) leapfrog steps
and negating the momentum.
\\( P \\) is the probability under the canonical distribution,
\\( T(X \given Y ) \\) is the conditional probability of proposing
and then accepting a move to region \\( X \\) if the current state is in
region \\( Y \\).
Example: bivariate Normal
$$H(q, p) = q^T \Sigma^{-1} q / 2 + p^T p / 2$$
with $$\Sigma = \left(\begin{array} n1 & 0.98 \\ 0.98 & 1\end{array} \right)$$
What position distribution does this correspond to?
Trajectories
Trace plots
100D Normal
Further improvements to HMC
- Use target Hessian to set the Mass matrix.
- No U-turn sampler.
- Other.
HMC and the Hessian
- Compute the Hessian at a local mode of the posterior.
- Effectively, this corresponds to "uncorrelating" the parameters.
Computational aspects of HMC
- Number of steps and step size: \\( l \\) and \\( \epsilon \\).
- Need to compute the posterior and gradient many times.
- For good implementations of HMC, need to compute the Mass matrix.
Mass/trajectory approximation tradeoff
- If the mass matrix is bad, need better leapfrog approximation.
- What is better - good mass or crude approximation of
Hamiltonian dynamics?
Extensions of HMC
- Bounded parameters
- Discrete parameters
Limitations
- Still prone to local modes.
- But if the modes are even somewhat connected, it's much better
than MH!
Augmentation strategies
- Hacker level
- Parameter expansion - adding fixed parameters (PX-EM, vEM)
- Data augmentation - adding random quantities (HMC, parallel/simulated tempering)
- It's all data augmentation in the Bayesian paradigm
Announcements
- T-shirt comps!
- April 4 - final project second checkpoint deadline.
- Set up time over email to check that final projects are on track.
Final slide
- Next lecture: Desicion theory and estimation.