Roadmap
- Why need to do even better MCMC
- Equi-energy (EE) sampler foundations
- Derivations
- EE in the context of other methods
- Is this the best we can do?
Where we are
Why need to do better
Reference: S. C. Kou, Qing Zhou and Wing Hung Wong. Equi-Energy Sampler
with Applications in Statistical
Inference and Statistical Mechanics.
Two-dimensional Normal mixture:
$$\begin{align} & f(x) = \\
& \sum_{i=1}^{20} {w_i \over 2\pi \sigma_i^2} \rr{exp} \left\{ - {1 \over 2 \sigma_i^2} (x - \mu_i)'(x - \mu_i)\right\}\end{align}$$
Details
- \\( \sigma_1 = \ldots = \sigma_{20} = 0.1 \\)
- \\( w_1 = \ldots = w_{20} = 0.05 \\)
- \\( \mu_1, \ldots, \mu_{20} \\) sufficiently spread out
What natural settings could this be from?
Example: how it looks
How do we detect multimodality?
- Plotting the posterior
- Run optim from random points
- Intuition/similar problems
Traditional methods
- Slow convergence:
- Classic Metropolis or Metropolis-Hastings
- HMC is only marginally better
- If we can use EM, we can only find local modes.
- This is not too bad if we know what we are doing.
And yet, we want to do it right
- How can we do a full MCMC, converge fast and be able
to easily run it in parallel:
- Full MCMC - not as good as exact sampling, but
gives adequate picture of posterior surface.
- Converge fast - save computing resources and
be able to set realistic deadlines.
- Easily run it in parallel - stay fashionable
and further increase speed.
- Introducing the Equi-Energy Sampler.
The EE foundations
We (once again) turn to physics.
The distribution of a system in thermal equilibrium
(Boltzmann distribution):
$$p_T(x) = { 1 \over Z(T)} \rr{exp} \left( -h(x) / T\right)$$
\\( h(x) \\) is the energy function (or Hamiltonian), \\( T \\)
is temperature, \\( Z(T) = \sum_x \rr{exp} \left( -h(x) / T\right) \\)
is the partition function.
Boltzmann law implies
Conditional distribution of the system given energy \\( h(x) = u \\)
is uniform on the equi-energy surface \\( \{ x: h(x) = u \} \\):
$$\begin{align} p(x \given h(x) = u) = {p_T(x) \over \sum_{x:h(x)=u} p_T(x) }\propto \rr{const}\end{align}$$
Let the infinitesimal volume of energy slice \\( \{ x:h(x) \in (u, u + du) \} \\) be
\\( \Omega(u) du \\).
\\( \Omega(u) \\) - density of states function.
Lemma
Let \\( \beta = 1/T \\).
$$\mu_g(\beta^{-1})Z(\beta^{-1}) = \int_0^\infty \nu_g(u) \Omega(u) e^{-\beta u} du$$
with thermal average \\( \mu_g(T) \\)
$$\mu_g(T) = \sum_x g(x) \rr{exp}(-h(x)/T) / Z(T)$$
and microcanonical average \\( \nu_g(u) \\): \\( \nu_g(u) = \rr{E}(g(X) \given h(X) = u) \\).
In particular
Let \\( g \equiv 1 \\), and partition function \\( Z(\beta^{-1}) \\) and density of
states \\( \Omega(u) \\) form a Laplace transform pair.
- The lemma suggest that we can use MCMC for density of states and
microcanonical averages to obtain Boltzmann averages and the partition
function.
Main idea of EE jumps
Sometimes move in the constant energy space.
EE sampler: definitions
- The goal is to sample from \\( \pi(x) \\).
- \\( \pi(x) \\) is presumably multimodal.
- Let \\( h(x) \\) be the associated energy function:
$$\pi(x) \propto \rr{exp}(-h(x))$$
- Sequence of energy levels
$$H_0 < H_1 < H_2 < \ldots < H_K < H_{K+1} = \infty$$
- \\( H_0 \leq \rr{inf}_x h(x) \\)
- Associated sequence of temperatures
$$1 = T_0 < T_1 < \ldots < T_K$$
More definitions
- There are \\( K+1 \\) distributions \\( \pi_i \\), \\( 0 \leq i \leq K \\).
- Corresponding energy functions are
$$h_i(x) = { 1 \over T } \left( h(x) \vee H_i\right)$$
- That is,
$$\pi_i(x) \propto \rr{exp} (-h_i(x))$$
- This way \\( \pi_0 \\) is the needed distribution \\( \pi \\).
- High \\( i \\) mean energy truncation and high temperatures,
so it's easier to move between local modes.
Main EE feature
- The equi-energy jump helps sample from \\(\pi_i \\) with
small \\( i\\).
- Equi-enery sets \\( S_i(H) = \{ x:h_i(x)=H \} \\) are
mutually consistent across \\( i \\) above the truncation
values.
- Construct empirical EE sets for \\( K, K-1, \ldots \\)
- The sets remain consistent going from energy/temperature \\( i \\)
to \\( i-1\\)
- Sample points at random from EE sets - avoid local modes.
Equi-Energy rings
- State space \\( \mathcal{X} = \cup_{j=0}^K D_j \\) with
\\(D_j = \{ x:h(x) \in \left[\left.H_j, H_{j+1}\right)\right.\} \\),
\\( 0 \leq j \leq K \\).
- For \\( x \in \mathcal{X} \\) let partition index \\( I(x) =j \\) if
\\( x \in D_j \\), or \\( h(x) \in \left.\left[H_j, H_{j+1}\right.\right) \\).
- For a chain running at energy/temperature \\( i \\), the empirical version
of EE sets \\( S_i(H) \\) are EE energy rings
\\( \hat{D}_j^{(i)}, 0 \leq j \leq K \\).
- \\( \hat{D}_j^{(i)} \\) consists of samples \\( X_n^{(i)} \\) with
\\( I(X_n^{(i)}) ) = j\\).
EE sampler: the algorithm
- Begin with an MH chain \\( X^{(K)} \\) targeting highest-order \\(\pi_K\\).
- After burn-in, make \\( K\\)th-order energy rings \\( \hat{D}_j^{(K)} \\).
- After \\( N \\) iterations, start chain \\( X^{(K-1)}\\).
- Update \\( X^{(K-1)}\\) with local MH move with probability
\\( 1- p_{ee} \\), and with the EE move with probability
\\( p_{ee} \\).
- Constuct and keep updating \\( \hat{D}_j^{(K-1)} \\).
- And so on.
The EE jump
For state \\( X^{(i)}_n\\), the new state \\( y \\) is chosen for the EE jump
uniformly from the higher-order energy ring
\\( \hat{D}_j^{(i+1)} \\) so that the energy is similar with
$$I(y) = I(X^{(i)}_n)$$
Accept with
probability
$$\rr{min} \left( 1, {\pi_{i}(y) \pi_{i+1}( X^{(i)}_n) \over \pi_{i}(X^{(i)}_n) \pi_{i+1}( y) }\right)$$
Connection to slice sampler
- Slice sampler has two steps:
- Sample auxillary variable \\( U \sim \rr{Unif}[0, \pi(X)] \\).
- Sample \\( X \sim \rr{Unif} \{ x: \pi(x) \geq U\} \\).
- In theory it looks good, but is it practical?
- Student discussion.
- Reference: wiki.
Steady-state distribution
Theorem
Under regularity conditions*, \\( X^{(i)} \\) are ergodic with \\( \pi_i \\)
as steady-state distribiutions.
* (i) \\( X^{(K)} \\) is irreducible and aperiodic, (ii) The MH transition
kernel connects adjacent energy sets, (iii) energy sets probabilities
\\( p_j^{(i)} = P_{\pi_i}(X \in D_j) > 0 \\) for all \\( i \\) and
\\( j \\).
Practical implementation of EE sampler
- Set energy levels by geometric progression, or
\\( \rr{log}(H_{i+1} - H_i) \\) evenly spaced.
- Set temperatures with \\( (H_{i+1} - H_i)/T_i \approx c \\) with
\\( c \in [1, 5] \\).
- Number of chains/energies/temperatures \\( K \\) can be rougly
proportional to the dimensionality of \\( \pi \\).
- Student discussion - what factors could drive the choice of
\\( K \\)?
More practical settings
- EE jump probability \\( p_{ee} \in [5\%, 30\%]\\)
"often works quite well".
- Self adaptation of the MH-proposal step size. Connect the
proposal jump to sampler acceptance rate.
- Student discussion - can we use HMC instead of MH here?
- Not quessing \\( H_0 = H_{\rr{min}} \\) is ok, can reconfigure
in the process of the sampler.
Back to example
Two-dimensional Normal mixture:
$$\begin{align} & f(x) = \\
& \sum_{i=1}^{20} {w_i \over 2\pi \sigma_i^2} \rr{exp} \left\{ - {1 \over 2 \sigma_i^2} (x - \mu_i)'(x - \mu_i)\right\}\end{align}$$
Use EE sampler with \\( K = 4 \\) (5 chains).
EE sampler performance
Target (lowest energy chain):
Other chains
Samples in each energy ring
EE sampler vs PT sampler
Implementing EE sampler with MPI
Announcements
- T-shirt comp ends this Wednesday!
- Final projects.
Final slide
- Next lecture: Approximate methods - Variational Inference