Roadmap
- Multimodality as a challenge for MCMC
- MCMC diagnostics
- Parallel Tempering algorithm
- General structure of a PT program
Where we are
Multimodality
- It is a very common feature in likelihood/posterior surfaces.
- What causes it?
- Non-identifiabilities.
- Overfitting.
- Multiple likely parameter configurations.
- To extract all information from the posterior surface, need
to be able to effectively sample from it.
Example
Now more complex
- Latent class regression as in the problem set: \\( 24 \times \rr{nclasses} \\)
regression parameters.
- Protein folding.
- Any complex system + imperfect modeling.
Areas of open research
- Overfitting.
- Model selection.
- Questions:
- How to disambiguate natural versus modeler-imposed multimodality?
- What metrics should we use to know that a given model is sufficient?
- In a world with perfect modelers, would MLE estimation suffice?
But for now
- Need to be able to sample from posterior surfaces to get an answer.
Strategies
- "Running your MCMC long enough"
- Start with random points, and:
- MCMC
- Point estimation + curvature
- Simulated Tempering
- Parallel Tempering
- Equi-energy Sampler
Why naive MCMC isn't good
- We don't know how long is "long enough". In addition to visual inspection,
there are several convergence diagnostics:
- Gelman-Rubin
- Geweke
- Raftery-Lewis
- Often, need to run prohibitively long to achieve acceptable mixing.
Gelman-Rubin diagnostic
- First, need to run multiple chains, for example, \\( m \\)
chains.
- Compute within-chain variance estimator for parameter \\( \theta \\)
$$W = {1 \over m } \sum_{j=1}^m s_j^2$$
with
$$s_j^2 = {1 \over n-1} \sum_{i=1}^n(\theta_{ij} - \bar{\theta}_{j})^2$$
- Likely, \\( W \\) underestimates the true variance.
Across chains
- Compute between chain variance estimate
$$B = { n \over m-1} \sum_{j=1}^m (\bar{\theta}_j - \bar{\bar{\theta}})^2$$
with $$\bar{\bar{\theta}} = {1 \over m} \sum_{j=1}^m \bar{\theta}_j$$
- This is variance of each chain average multimplied by \\( n \\)
- the number of draws in each chain (why?)
Viariance of stationary distribution
- Based on \\( W \\) and \\( B \\), estimate variance of
stationary distribution
$$\hat{\rr{Var}} \theta = \left( 1 - {1 \over n}\right) W + {1 \over n} B$$
- Because of overdispersion in starting values, this is likely
to overestimate the true variance.
- The estimator is unbiased if starting values come
from the stationary distribution.
Scale reduction factor
Compute
$$\hat{R} = \sqrt{\hat{\rr{Var}} \theta \over W}$$
If \\( \hat{R} \\) is high, we should run chains longer
to improve convergence to the stationary distribution.
- For this diagnostic, need to run multiple chains.
- Which diagnostic is better - Geweke or Gelman-Rubin?
Gelman-Rubin diagnostic code
mh.draws1 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws2 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws3 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws4 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws5 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.list <- mcmc.list(list(mh.draws1, mh.draws2, mh.draws3, mh.draws4,
+ mh.draws5))
gelman.diag(mh.list)
Use
gelman.plot(mh.list)
Starting with random points
- Very practical.
- But there isn't much guarantee that we explore the whole space.
- A good starting point for a more substantial algorithm.
Simulated Tempering
- Data augmentation! Ideas borrowed from physics.
- Every density function can be rewritten in this way:
$$f(x) = e^{-(-\rr{log}f(x))}$$
- Define energy \\( E(x) = -\rr{log}f(x)\\)
- Aim to simulate from
$$f(x, T) = e^{- {1 \over T} E(x)}$$
- \\( T \in \{ T_1, T_2, \ldots, T_m\}\\) - pre-defined set of
temperature levels (presumably including temperature 1).
Properties
- Sample from the joint temperature-parameter space.
- Only use the draws from the parameter space with temperature 1.
- Will we ever converge in the temperature space?
- Can this algorithm be parallelized?
Parallel Tempering
- Aims to use the ideas of simulated tempering.
- Amenable to (designed for) parallelization.
- Idea: run chains at different temperatures in parallel,
and "swap" parameter values between them once in a while.
- Goal: improve mixing at temperature 1.
Definitions
Define composite system
$$\Pi (X) = \prod_{i=1}^N {e^{- \beta_i E(x_i)} \over Z(T_i)}$$
with \\( \beta_i = {1 \over T_i }\\) and \\( Z(T_i) = \int e^{- \beta_i E(x_i)} dx_i\\).
- \\( \Pi \\) reduces to \\( f \\) if \\( T_i = 1 \\).
- At each iteration \\( t \\), construct a Markov Chain to
sample from \\( E(\cdot) \\) at \\(T_i\\) for each \\(i \\).
Putting into context
Problem set 5 example
$$\begin{align}
& \rr{log} p(\vec{Z}_i, B \given C_{ijk} ) = \rr{const} + \\
& \sum_{i=1}^I \left( \sum_{j \in J_i} \sum_{k \in K_{ij}}
\left[ \vec{x}_{jk}' B -
\rr{log} \sum_{l = 1}^{L_j} e^{ \vec{x}_{jl}' B} \right] \right)
\cdot \vec{Z}_i
\end{align}$$
Call parameters \\( \theta \\), look at
\\( \rr{log}p(\theta \given y) \\).
Define components for Parallel Tempering
- Energy function \\( E(\theta) = - \rr{log} p(\theta \given y) \\).
- Set energy levels \\( T = \{ T_1, T_2, \ldots, T_n \} \\)
- What is the optimal temperature ladder?
- Composite system
$$\Pi (\theta) \propto \prod_{i=1}^n \rr{exp}\left\{-{1\over T_i} E(\theta_i)\right\}$$
The algorithm
initialize N temperatures x_1, ..., x_N
initialize N replicas x_1, ..., x_N
repeat {
sample a new point x_i for each replica i
with probability theta {
sample an integer j from U[1, N-1]
attempt a swap between replicas j and j + 1
}
}
Intuition behind PT
PT performance
Acceptance probabilities
- Local moves
\\(w_{\rr{local}} (x_i \rightarrow x_i') = e^{-\beta_i \Delta_i E} = e^{-\beta_i (E(x_i') - E(x_i))}\\)
- Replica exchange across chains
$$\begin{align} &P_{\rr{replica}} (x_i \leftrightarrow x_{i+1}) = \\
& P( x_1, \ldots, x_{i+1}, x_i, \ldots, x_N \given \ldots, x_i, x_{i+1}, \ldots) \\
& = \rr{min} \left( 1, {\Pi(x_1, \ldots, x_{i+1}, x_i, \ldots, x_N) \over \Pi(x_1, \ldots, x_i, x_{i+1}, \ldots, x_N) }\right)\\
& = \rr{min} \left( 1, e^{(\beta_i - \beta_{i+1})(E(x_i) - E(x_{i+1}))}\right)\end{align}$$
How to make it parallel
- Run everything on one node (multicore vs no parallelization).
- Can tell worker nodes to run chains.
- Handling replica exchange communication (main thing):
- All through the master node.
- Allow maximum inter-worker communication, independent of the master.
- Pros and cons of each method?
Blocking vs non-blocking design
- Blocking design is when your parallel computing
system sits idle waiting for something to occur.
- Non-blocking design is when we minimize unnecessary
wait times so that the nodes do what they are
supposed to do (compute) and don't sit idle.
Non-blocking design
- Want to minimize inter-node communication
- Running identical random number streams on workers.
- Communicate directly with another worker and not through master.
PT random number streams
- For this, need to be able to initialize random number
streams at the same value and send them to workers.
Random number generation
- Reference: "Parallel Random Number Generation in C++ and R Using RngStream"
by Andrew Karl, Randy Eubank, and Dennis Young
- Use sequence:
$$\begin{align}x_{1,n} = &(1403580 \cdot x_{1,n−2}−810728\cdot x_{1,n−3}) \\
& \rr{mod } 4294967087 \\
x_{2,n} = & (527612 \cdot x_{2,n−1}−1370589 \cdot x_{2,n−3}) \\
& \rr{mod } 4294944443 \end{align}$$
Get uniform RV
The "uniform" random draw is \\( u_n \\):
$$\begin{align} z_n & = (x_{1,n} − x_{2,n} ) \rr{ mod } 4294967087 \\
u_n & = \begin{cases} z_n/4294967088 & \rr{if } z_n > 0 \\ 4294967087/4294967088 & \rr{if } z_n = 0 \end{cases}\end{align}$$
Parallel Tempering program structure
Given target density, the PT algorithm is almost generic
- The only component that's not generic
is the within-chain sampling algorithm.
- What MCMC algorithms could we use for the within-chain
sampling?
- This means we can have a generic implementation that
we can vary from problem to problem.
Announcements
- Code leaderboard winner for pset 3!
- Final project second checkpoint.
Final slide
- Next lecture: MPI for Parallel Tempering.