 # Equi-energy sampler

### Stat 221, Lecture 20

@snesterko

• Why need to do even better MCMC
• Equi-energy (EE) sampler foundations
• Derivations
• EE in the context of other methods
• Is this the best we can do?

### Where we are ## Why need to do better

Reference: S. C. Kou, Qing Zhou and Wing Hung Wong. Equi-Energy Sampler with Applications in Statistical Inference and Statistical Mechanics.

Two-dimensional Normal mixture:

\begin{align} & f(x) = \\ & \sum_{i=1}^{20} {w_i \over 2\pi \sigma_i^2} \rr{exp} \left\{ - {1 \over 2 \sigma_i^2} (x - \mu_i)'(x - \mu_i)\right\}\end{align}

## Details

• \$$\sigma_1 = \ldots = \sigma_{20} = 0.1 \$$
• \$$w_1 = \ldots = w_{20} = 0.05 \$$
• \$$\mu_1, \ldots, \mu_{20} \$$ sufficiently spread out

What natural settings could this be from?

## Example: how it looks ## How do we detect multimodality?

• Plotting the posterior
• Run optim from random points
• Intuition/similar problems

• Slow convergence:
• Classic Metropolis or Metropolis-Hastings
• HMC is only marginally better
• If we can use EM, we can only find local modes.
• This is not too bad if we know what we are doing.

### And yet, we want to do it right

• How can we do a full MCMC, converge fast and be able to easily run it in parallel:
• Full MCMC - not as good as exact sampling, but gives adequate picture of posterior surface.
• Converge fast - save computing resources and be able to set realistic deadlines.
• Easily run it in parallel - stay fashionable and further increase speed.
• Introducing the Equi-Energy Sampler.

## The EE foundations

We (once again) turn to physics.

The distribution of a system in thermal equilibrium (Boltzmann distribution):

$$p_T(x) = { 1 \over Z(T)} \rr{exp} \left( -h(x) / T\right)$$

\$$h(x) \$$ is the energy function (or Hamiltonian), \$$T \$$ is temperature, \$$Z(T) = \sum_x \rr{exp} \left( -h(x) / T\right) \$$ is the partition function.

## Boltzmann law implies

Conditional distribution of the system given energy \$$h(x) = u \$$ is uniform on the equi-energy surface \$$\{ x: h(x) = u \} \$$:

\begin{align} p(x \given h(x) = u) = {p_T(x) \over \sum_{x:h(x)=u} p_T(x) }\propto \rr{const}\end{align}

Let the infinitesimal volume of energy slice \$$\{ x:h(x) \in (u, u + du) \} \$$ be \$$\Omega(u) du \$$.

\$$\Omega(u) \$$ - density of states function.

### Lemma

Let \$$\beta = 1/T \$$.

$$\mu_g(\beta^{-1})Z(\beta^{-1}) = \int_0^\infty \nu_g(u) \Omega(u) e^{-\beta u} du$$

with thermal average \$$\mu_g(T) \$$

$$\mu_g(T) = \sum_x g(x) \rr{exp}(-h(x)/T) / Z(T)$$

and microcanonical average \$$\nu_g(u) \$$: \$$\nu_g(u) = \rr{E}(g(X) \given h(X) = u) \$$.

## In particular

Let \$$g \equiv 1 \$$, and partition function \$$Z(\beta^{-1}) \$$ and density of states \$$\Omega(u) \$$ form a Laplace transform pair.

• The lemma suggest that we can use MCMC for density of states and microcanonical averages to obtain Boltzmann averages and the partition function.

### Main idea of EE jumps Sometimes move in the constant energy space.

## EE sampler: definitions

• The goal is to sample from \$$\pi(x) \$$.
• \$$\pi(x) \$$ is presumably multimodal.
• Let \$$h(x) \$$ be the associated energy function: $$\pi(x) \propto \rr{exp}(-h(x))$$
• Sequence of energy levels $$H_0 < H_1 < H_2 < \ldots < H_K < H_{K+1} = \infty$$
• \$$H_0 \leq \rr{inf}_x h(x) \$$
• Associated sequence of temperatures $$1 = T_0 < T_1 < \ldots < T_K$$

## More definitions

• There are \$$K+1 \$$ distributions \$$\pi_i \$$, \$$0 \leq i \leq K \$$.
• Corresponding energy functions are $$h_i(x) = { 1 \over T } \left( h(x) \vee H_i\right)$$
• That is, $$\pi_i(x) \propto \rr{exp} (-h_i(x))$$
• This way \$$\pi_0 \$$ is the needed distribution \$$\pi \$$.
• High \$$i \$$ mean energy truncation and high temperatures, so it's easier to move between local modes.

## Main EE feature

• The equi-energy jump helps sample from \$$\pi_i \$$ with small \$$i\$$.
• Equi-enery sets \$$S_i(H) = \{ x:h_i(x)=H \} \$$ are mutually consistent across \$$i \$$ above the truncation values.
• Construct empirical EE sets for \$$K, K-1, \ldots \$$
• The sets remain consistent going from energy/temperature \$$i \$$ to \$$i-1\$$
• Sample points at random from EE sets - avoid local modes.

### Equi-Energy rings

• State space \$$\mathcal{X} = \cup_{j=0}^K D_j \$$ with \$$D_j = \{ x:h(x) \in \left[\left.H_j, H_{j+1}\right)\right.\} \$$, \$$0 \leq j \leq K \$$.
• For \$$x \in \mathcal{X} \$$ let partition index \$$I(x) =j \$$ if \$$x \in D_j \$$, or \$$h(x) \in \left.\left[H_j, H_{j+1}\right.\right) \$$.
• For a chain running at energy/temperature \$$i \$$, the empirical version of EE sets \$$S_i(H) \$$ are EE energy rings \$$\hat{D}_j^{(i)}, 0 \leq j \leq K \$$.
• \$$\hat{D}_j^{(i)} \$$ consists of samples \$$X_n^{(i)} \$$ with \$$I(X_n^{(i)}) ) = j\$$.

### EE sampler: the algorithm

• Begin with an MH chain \$$X^{(K)} \$$ targeting highest-order \$$\pi_K\$$.
• After burn-in, make \$$K\$$th-order energy rings \$$\hat{D}_j^{(K)} \$$.
• After \$$N \$$ iterations, start chain \$$X^{(K-1)}\$$.
• Update \$$X^{(K-1)}\$$ with local MH move with probability \$$1- p_{ee} \$$, and with the EE move with probability \$$p_{ee} \$$.
• Constuct and keep updating \$$\hat{D}_j^{(K-1)} \$$.
• And so on.

## The EE jump

For state \$$X^{(i)}_n\$$, the new state \$$y \$$ is chosen for the EE jump uniformly from the higher-order energy ring \$$\hat{D}_j^{(i+1)} \$$ so that the energy is similar with $$I(y) = I(X^{(i)}_n)$$

Accept with probability $$\rr{min} \left( 1, {\pi_{i}(y) \pi_{i+1}( X^{(i)}_n) \over \pi_{i}(X^{(i)}_n) \pi_{i+1}( y) }\right)$$

## Connection to slice sampler

• Slice sampler has two steps:
1. Sample auxillary variable \$$U \sim \rr{Unif}[0, \pi(X)] \$$.
2. Sample \$$X \sim \rr{Unif} \{ x: \pi(x) \geq U\} \$$.
• In theory it looks good, but is it practical?
• Student discussion.
• Reference: wiki.

Theorem

Under regularity conditions*, \$$X^{(i)} \$$ are ergodic with \$$\pi_i \$$ as steady-state distribiutions.

* (i) \$$X^{(K)} \$$ is irreducible and aperiodic, (ii) The MH transition kernel connects adjacent energy sets, (iii) energy sets probabilities \$$p_j^{(i)} = P_{\pi_i}(X \in D_j) > 0 \$$ for all \$$i \$$ and \$$j \$$.

## Practical implementation of EE sampler

• Set energy levels by geometric progression, or \$$\rr{log}(H_{i+1} - H_i) \$$ evenly spaced.
• Set temperatures with \$$(H_{i+1} - H_i)/T_i \approx c \$$ with \$$c \in [1, 5] \$$.
• Number of chains/energies/temperatures \$$K \$$ can be rougly proportional to the dimensionality of \$$\pi \$$.
• Student discussion - what factors could drive the choice of \$$K \$$?

## More practical settings

• EE jump probability \$$p_{ee} \in [5\%, 30\%]\$$ "often works quite well".
• Self adaptation of the MH-proposal step size. Connect the proposal jump to sampler acceptance rate.
• Student discussion - can we use HMC instead of MH here?
• Not quessing \$$H_0 = H_{\rr{min}} \$$ is ok, can reconfigure in the process of the sampler.

## Back to example

Two-dimensional Normal mixture:

\begin{align} & f(x) = \\ & \sum_{i=1}^{20} {w_i \over 2\pi \sigma_i^2} \rr{exp} \left\{ - {1 \over 2 \sigma_i^2} (x - \mu_i)'(x - \mu_i)\right\}\end{align}

Use EE sampler with \$$K = 4 \$$ (5 chains).

## EE sampler performance

Target (lowest energy chain): Other chains ## Samples in each energy ring ## EE sampler vs PT sampler ## Announcements

• T-shirt comp ends this Wednesday!
• Final projects.

## Final slide

• Next lecture: Approximate methods - Variational Inference