- Why need to do even better MCMC
- Equi-energy (EE) sampler foundations
- Derivations
- EE in the context of other methods
- Is this the best we can do?

Reference: S. C. Kou, Qing Zhou and Wing Hung Wong. Equi-Energy Sampler with Applications in Statistical Inference and Statistical Mechanics.

Two-dimensional Normal mixture:

$$\begin{align} & f(x) = \\ & \sum_{i=1}^{20} {w_i \over 2\pi \sigma_i^2} \rr{exp} \left\{ - {1 \over 2 \sigma_i^2} (x - \mu_i)'(x - \mu_i)\right\}\end{align}$$

- \\( \sigma_1 = \ldots = \sigma_{20} = 0.1 \\)
- \\( w_1 = \ldots = w_{20} = 0.05 \\)
- \\( \mu_1, \ldots, \mu_{20} \\) sufficiently spread out

What natural settings could this be from?

- Plotting the posterior
- Run optim from random points
- Intuition/similar problems

- Slow convergence:
- Classic Metropolis or Metropolis-Hastings
- HMC is only marginally better
- If we can use EM, we can only find local modes.
- This is not too bad if we know what we are doing.

- How can we do a full MCMC, converge fast and be able
to easily run it in parallel:
- Full MCMC - not as good as exact sampling, but gives adequate picture of posterior surface.
- Converge fast - save computing resources and be able to set realistic deadlines.
- Easily run it in parallel - stay fashionable and further increase speed.

- Introducing the Equi-Energy Sampler.

We (once again) turn to physics.

The distribution of a system in thermal equilibrium (Boltzmann distribution):

$$p_T(x) = { 1 \over Z(T)} \rr{exp} \left( -h(x) / T\right)$$

\\( h(x) \\) is the energy function (or Hamiltonian), \\( T \\) is temperature, \\( Z(T) = \sum_x \rr{exp} \left( -h(x) / T\right) \\) is the partition function.

Conditional distribution of the system given energy \\( h(x) = u \\) is uniform on the equi-energy surface \\( \{ x: h(x) = u \} \\):

$$\begin{align} p(x \given h(x) = u) = {p_T(x) \over \sum_{x:h(x)=u} p_T(x) }\propto \rr{const}\end{align}$$

Let the infinitesimal volume of energy slice \\( \{ x:h(x) \in (u, u + du) \} \\) be \\( \Omega(u) du \\).

\\( \Omega(u) \\) - density of states function.

Let \\( \beta = 1/T \\).

$$\mu_g(\beta^{-1})Z(\beta^{-1}) = \int_0^\infty \nu_g(u) \Omega(u) e^{-\beta u} du$$

with thermal average \\( \mu_g(T) \\)

$$\mu_g(T) = \sum_x g(x) \rr{exp}(-h(x)/T) / Z(T)$$

and microcanonical average \\( \nu_g(u) \\): \\( \nu_g(u) = \rr{E}(g(X) \given h(X) = u) \\).

Let \\( g \equiv 1 \\), and partition function \\( Z(\beta^{-1}) \\) and density of states \\( \Omega(u) \\) form a Laplace transform pair.

- The lemma suggest that we can use MCMC for density of states and microcanonical averages to obtain Boltzmann averages and the partition function.

Sometimes move in the constant energy space.

- The goal is to sample from \\( \pi(x) \\).
- \\( \pi(x) \\) is presumably multimodal.
- Let \\( h(x) \\) be the associated energy function: $$\pi(x) \propto \rr{exp}(-h(x))$$
- Sequence of energy levels $$H_0 < H_1 < H_2 < \ldots < H_K < H_{K+1} = \infty$$
- \\( H_0 \leq \rr{inf}_x h(x) \\)
- Associated sequence of temperatures $$1 = T_0 < T_1 < \ldots < T_K$$

- There are \\( K+1 \\) distributions \\( \pi_i \\), \\( 0 \leq i \leq K \\).
- Corresponding energy functions are $$h_i(x) = { 1 \over T } \left( h(x) \vee H_i\right)$$
- That is, $$\pi_i(x) \propto \rr{exp} (-h_i(x))$$
- This way \\( \pi_0 \\) is the needed distribution \\( \pi \\).
- High \\( i \\) mean energy truncation and high temperatures, so it's easier to move between local modes.

- The equi-energy jump helps sample from \\(\pi_i \\) with small \\( i\\).
- Equi-enery sets \\( S_i(H) = \{ x:h_i(x)=H \} \\) are mutually consistent across \\( i \\) above the truncation values.
- Construct empirical EE sets for \\( K, K-1, \ldots \\)
- The sets remain consistent going from energy/temperature \\( i \\) to \\( i-1\\)
- Sample points at random from EE sets - avoid local modes.

- State space \\( \mathcal{X} = \cup_{j=0}^K D_j \\) with \\(D_j = \{ x:h(x) \in \left[\left.H_j, H_{j+1}\right)\right.\} \\), \\( 0 \leq j \leq K \\).
- For \\( x \in \mathcal{X} \\) let partition index \\( I(x) =j \\) if \\( x \in D_j \\), or \\( h(x) \in \left.\left[H_j, H_{j+1}\right.\right) \\).
- For a chain running at energy/temperature \\( i \\), the empirical version of EE sets \\( S_i(H) \\) are EE energy rings \\( \hat{D}_j^{(i)}, 0 \leq j \leq K \\).
- \\( \hat{D}_j^{(i)} \\) consists of samples \\( X_n^{(i)} \\) with \\( I(X_n^{(i)}) ) = j\\).

- Begin with an MH chain \\( X^{(K)} \\) targeting highest-order \\(\pi_K\\).
- After burn-in, make \\( K\\)th-order energy rings \\( \hat{D}_j^{(K)} \\).
- After \\( N \\) iterations, start chain \\( X^{(K-1)}\\).
- Update \\( X^{(K-1)}\\) with local MH move with probability \\( 1- p_{ee} \\), and with the EE move with probability \\( p_{ee} \\).
- Constuct and keep updating \\( \hat{D}_j^{(K-1)} \\).
- And so on.

For state \\( X^{(i)}_n\\), the new state \\( y \\) is chosen for the EE jump uniformly from the higher-order energy ring \\( \hat{D}_j^{(i+1)} \\) so that the energy is similar with $$I(y) = I(X^{(i)}_n)$$

Accept with probability $$\rr{min} \left( 1, {\pi_{i}(y) \pi_{i+1}( X^{(i)}_n) \over \pi_{i}(X^{(i)}_n) \pi_{i+1}( y) }\right)$$

- Slice sampler has two steps:
- Sample auxillary variable \\( U \sim \rr{Unif}[0, \pi(X)] \\).
- Sample \\( X \sim \rr{Unif} \{ x: \pi(x) \geq U\} \\).
- In theory it looks good, but is it practical?
- Student discussion.
- Reference: wiki.

*Theorem*

Under regularity conditions*, \\( X^{(i)} \\) are ergodic with \\( \pi_i \\) as steady-state distribiutions.

* (i) \\( X^{(K)} \\) is irreducible and aperiodic, (ii) The MH transition kernel connects adjacent energy sets, (iii) energy sets probabilities \\( p_j^{(i)} = P_{\pi_i}(X \in D_j) > 0 \\) for all \\( i \\) and \\( j \\).

- Set energy levels by geometric progression, or \\( \rr{log}(H_{i+1} - H_i) \\) evenly spaced.
- Set temperatures with \\( (H_{i+1} - H_i)/T_i \approx c \\) with \\( c \in [1, 5] \\).
- Number of chains/energies/temperatures \\( K \\) can be rougly proportional to the dimensionality of \\( \pi \\).
- Student discussion - what factors could drive the choice of \\( K \\)?

- EE jump probability \\( p_{ee} \in [5\%, 30\%]\\) "often works quite well".
- Self adaptation of the MH-proposal step size. Connect the proposal jump to sampler acceptance rate.
- Student discussion - can we use HMC instead of MH here?
- Not quessing \\( H_0 = H_{\rr{min}} \\) is ok, can reconfigure in the process of the sampler.

Two-dimensional Normal mixture:

$$\begin{align} & f(x) = \\ & \sum_{i=1}^{20} {w_i \over 2\pi \sigma_i^2} \rr{exp} \left\{ - {1 \over 2 \sigma_i^2} (x - \mu_i)'(x - \mu_i)\right\}\end{align}$$

Use EE sampler with \\( K = 4 \\) (5 chains).

Target (lowest energy chain):

Other chains

- There is a number of MPI implementations.
- Student discussion - how would you implement it?
- Specifically, how to work with inter-worker communication?

- T-shirt comp ends this Wednesday!
- Final projects.

- Slides nesterko.com/lectures/stat221-2012/lecture20
- Class website theory.info/harvardstat221
- Class Piazza piazza.com/class#spring2013/stat221
- Class Twitter twitter.com/harvardstat221

- Next lecture: Approximate methods - Variational Inference