Equi-energy sampler

Stat 221, Lecture 20

@snesterko

Roadmap

  • Why need to do even better MCMC
  • Equi-energy (EE) sampler foundations
  • Derivations
  • EE in the context of other methods
  • Is this the best we can do?

Where we are

Why need to do better

Reference: S. C. Kou, Qing Zhou and Wing Hung Wong. Equi-Energy Sampler with Applications in Statistical Inference and Statistical Mechanics.

Two-dimensional Normal mixture:

$$\begin{align} & f(x) = \\ & \sum_{i=1}^{20} {w_i \over 2\pi \sigma_i^2} \rr{exp} \left\{ - {1 \over 2 \sigma_i^2} (x - \mu_i)'(x - \mu_i)\right\}\end{align}$$

Details

  • \\( \sigma_1 = \ldots = \sigma_{20} = 0.1 \\)
  • \\( w_1 = \ldots = w_{20} = 0.05 \\)
  • \\( \mu_1, \ldots, \mu_{20} \\) sufficiently spread out

What natural settings could this be from?

Example: how it looks

How do we detect multimodality?

  • Plotting the posterior
  • Run optim from random points
  • Intuition/similar problems

Traditional methods

  • Slow convergence:
    • Classic Metropolis or Metropolis-Hastings
    • HMC is only marginally better
  • If we can use EM, we can only find local modes.
    • This is not too bad if we know what we are doing.

And yet, we want to do it right

  • How can we do a full MCMC, converge fast and be able to easily run it in parallel:
    • Full MCMC - not as good as exact sampling, but gives adequate picture of posterior surface.
    • Converge fast - save computing resources and be able to set realistic deadlines.
    • Easily run it in parallel - stay fashionable and further increase speed.
  • Introducing the Equi-Energy Sampler.

The EE foundations

We (once again) turn to physics.

The distribution of a system in thermal equilibrium (Boltzmann distribution):

$$p_T(x) = { 1 \over Z(T)} \rr{exp} \left( -h(x) / T\right)$$

\\( h(x) \\) is the energy function (or Hamiltonian), \\( T \\) is temperature, \\( Z(T) = \sum_x \rr{exp} \left( -h(x) / T\right) \\) is the partition function.

Boltzmann law implies

Conditional distribution of the system given energy \\( h(x) = u \\) is uniform on the equi-energy surface \\( \{ x: h(x) = u \} \\):

$$\begin{align} p(x \given h(x) = u) = {p_T(x) \over \sum_{x:h(x)=u} p_T(x) }\propto \rr{const}\end{align}$$

Let the infinitesimal volume of energy slice \\( \{ x:h(x) \in (u, u + du) \} \\) be \\( \Omega(u) du \\).

\\( \Omega(u) \\) - density of states function.

Lemma

Let \\( \beta = 1/T \\).

$$\mu_g(\beta^{-1})Z(\beta^{-1}) = \int_0^\infty \nu_g(u) \Omega(u) e^{-\beta u} du$$

with thermal average \\( \mu_g(T) \\)

$$\mu_g(T) = \sum_x g(x) \rr{exp}(-h(x)/T) / Z(T)$$

and microcanonical average \\( \nu_g(u) \\): \\( \nu_g(u) = \rr{E}(g(X) \given h(X) = u) \\).

In particular

Let \\( g \equiv 1 \\), and partition function \\( Z(\beta^{-1}) \\) and density of states \\( \Omega(u) \\) form a Laplace transform pair.

  • The lemma suggest that we can use MCMC for density of states and microcanonical averages to obtain Boltzmann averages and the partition function.

Main idea of EE jumps

Sometimes move in the constant energy space.

EE sampler: definitions

  • The goal is to sample from \\( \pi(x) \\).
  • \\( \pi(x) \\) is presumably multimodal.
  • Let \\( h(x) \\) be the associated energy function: $$\pi(x) \propto \rr{exp}(-h(x))$$
  • Sequence of energy levels $$H_0 < H_1 < H_2 < \ldots < H_K < H_{K+1} = \infty$$
  • \\( H_0 \leq \rr{inf}_x h(x) \\)
  • Associated sequence of temperatures $$1 = T_0 < T_1 < \ldots < T_K$$

More definitions

  • There are \\( K+1 \\) distributions \\( \pi_i \\), \\( 0 \leq i \leq K \\).
  • Corresponding energy functions are $$h_i(x) = { 1 \over T } \left( h(x) \vee H_i\right)$$
  • That is, $$\pi_i(x) \propto \rr{exp} (-h_i(x))$$
  • This way \\( \pi_0 \\) is the needed distribution \\( \pi \\).
  • High \\( i \\) mean energy truncation and high temperatures, so it's easier to move between local modes.

Main EE feature

  • The equi-energy jump helps sample from \\(\pi_i \\) with small \\( i\\).
  • Equi-enery sets \\( S_i(H) = \{ x:h_i(x)=H \} \\) are mutually consistent across \\( i \\) above the truncation values.
  • Construct empirical EE sets for \\( K, K-1, \ldots \\)
  • The sets remain consistent going from energy/temperature \\( i \\) to \\( i-1\\)
  • Sample points at random from EE sets - avoid local modes.

Equi-Energy rings

  • State space \\( \mathcal{X} = \cup_{j=0}^K D_j \\) with \\(D_j = \{ x:h(x) \in \left[\left.H_j, H_{j+1}\right)\right.\} \\), \\( 0 \leq j \leq K \\).
  • For \\( x \in \mathcal{X} \\) let partition index \\( I(x) =j \\) if \\( x \in D_j \\), or \\( h(x) \in \left.\left[H_j, H_{j+1}\right.\right) \\).
  • For a chain running at energy/temperature \\( i \\), the empirical version of EE sets \\( S_i(H) \\) are EE energy rings \\( \hat{D}_j^{(i)}, 0 \leq j \leq K \\).
  • \\( \hat{D}_j^{(i)} \\) consists of samples \\( X_n^{(i)} \\) with \\( I(X_n^{(i)}) ) = j\\).

EE sampler: the algorithm

  • Begin with an MH chain \\( X^{(K)} \\) targeting highest-order \\(\pi_K\\).
  • After burn-in, make \\( K\\)th-order energy rings \\( \hat{D}_j^{(K)} \\).
  • After \\( N \\) iterations, start chain \\( X^{(K-1)}\\).
  • Update \\( X^{(K-1)}\\) with local MH move with probability \\( 1- p_{ee} \\), and with the EE move with probability \\( p_{ee} \\).
  • Constuct and keep updating \\( \hat{D}_j^{(K-1)} \\).
  • And so on.

The EE jump

For state \\( X^{(i)}_n\\), the new state \\( y \\) is chosen for the EE jump uniformly from the higher-order energy ring \\( \hat{D}_j^{(i+1)} \\) so that the energy is similar with $$I(y) = I(X^{(i)}_n)$$

Accept with probability $$\rr{min} \left( 1, {\pi_{i}(y) \pi_{i+1}( X^{(i)}_n) \over \pi_{i}(X^{(i)}_n) \pi_{i+1}( y) }\right)$$

Connection to slice sampler

  • Slice sampler has two steps:
    1. Sample auxillary variable \\( U \sim \rr{Unif}[0, \pi(X)] \\).
    2. Sample \\( X \sim \rr{Unif} \{ x: \pi(x) \geq U\} \\).
  • In theory it looks good, but is it practical?
  • Student discussion.
  • Reference: wiki.

Steady-state distribution

Theorem

Under regularity conditions*, \\( X^{(i)} \\) are ergodic with \\( \pi_i \\) as steady-state distribiutions.

* (i) \\( X^{(K)} \\) is irreducible and aperiodic, (ii) The MH transition kernel connects adjacent energy sets, (iii) energy sets probabilities \\( p_j^{(i)} = P_{\pi_i}(X \in D_j) > 0 \\) for all \\( i \\) and \\( j \\).

Practical implementation of EE sampler

  • Set energy levels by geometric progression, or \\( \rr{log}(H_{i+1} - H_i) \\) evenly spaced.
  • Set temperatures with \\( (H_{i+1} - H_i)/T_i \approx c \\) with \\( c \in [1, 5] \\).
  • Number of chains/energies/temperatures \\( K \\) can be rougly proportional to the dimensionality of \\( \pi \\).
  • Student discussion - what factors could drive the choice of \\( K \\)?

More practical settings

  • EE jump probability \\( p_{ee} \in [5\%, 30\%]\\) "often works quite well".
  • Self adaptation of the MH-proposal step size. Connect the proposal jump to sampler acceptance rate.
  • Student discussion - can we use HMC instead of MH here?
  • Not quessing \\( H_0 = H_{\rr{min}} \\) is ok, can reconfigure in the process of the sampler.

Back to example

Two-dimensional Normal mixture:

$$\begin{align} & f(x) = \\ & \sum_{i=1}^{20} {w_i \over 2\pi \sigma_i^2} \rr{exp} \left\{ - {1 \over 2 \sigma_i^2} (x - \mu_i)'(x - \mu_i)\right\}\end{align}$$

Use EE sampler with \\( K = 4 \\) (5 chains).

EE sampler performance

Target (lowest energy chain):

Other chains

Samples in each energy ring

EE sampler vs PT sampler

Implementing EE sampler with MPI

Announcements

  • T-shirt comp ends this Wednesday!
  • Final projects.

Resources

Final slide

  • Next lecture: Approximate methods - Variational Inference