# Parallel Tempering

### Stat 221, Lecture 18

@snesterko

• Multimodality as a challenge for MCMC
• MCMC diagnostics
• Parallel Tempering algorithm
• General structure of a PT program

## Multimodality

• It is a very common feature in likelihood/posterior surfaces.
• What causes it?
• Non-identifiabilities.
• Overfitting.
• Multiple likely parameter configurations.
• To extract all information from the posterior surface, need to be able to effectively sample from it.

## Now more complex

• Latent class regression as in the problem set: \$$24 \times \rr{nclasses} \$$ regression parameters.
• Protein folding.
• Any complex system + imperfect modeling.

## Areas of open research

• Overfitting.
• Model selection.
• Questions:
• How to disambiguate natural versus modeler-imposed multimodality?
• What metrics should we use to know that a given model is sufficient?
• In a world with perfect modelers, would MLE estimation suffice?

## But for now

• Need to be able to sample from posterior surfaces to get an answer.

## Strategies

• "Running your MCMC long enough"
• MCMC
• Point estimation + curvature
• Simulated Tempering
• Parallel Tempering
• Equi-energy Sampler

## Why naive MCMC isn't good

• We don't know how long is "long enough". In addition to visual inspection, there are several convergence diagnostics:
• Gelman-Rubin
• Geweke
• Raftery-Lewis
• Often, need to run prohibitively long to achieve acceptable mixing.

## The Geweke diagnostic

• Run the MCMC chain for some time.
• Consider two non-overlapping parts of the chain.
• Run a difference in means test adjusting for autocorrelation.
> geweke.diag(mh.draws)
Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5
var1   var2
0.6149 0.1035
• Gelman-Rubin diagnostic is slightly more involved.

### Gelman-Rubin diagnostic

• First, need to run multiple chains, for example, \$$m \$$ chains.
• Compute within-chain variance estimator for parameter \$$\theta \$$ $$W = {1 \over m } \sum_{j=1}^m s_j^2$$ with $$s_j^2 = {1 \over n-1} \sum_{i=1}^n(\theta_{ij} - \bar{\theta}_{j})^2$$
• Likely, \$$W \$$ underestimates the true variance.

## Across chains

• Compute between chain variance estimate $$B = { n \over m-1} \sum_{j=1}^m (\bar{\theta}_j - \bar{\bar{\theta}})^2$$ with $$\bar{\bar{\theta}} = {1 \over m} \sum_{j=1}^m \bar{\theta}_j$$
• This is variance of each chain average multimplied by \$$n \$$ - the number of draws in each chain (why?)

## Viariance of stationary distribution

• Based on \$$W \$$ and \$$B \$$, estimate variance of stationary distribution $$\hat{\rr{Var}} \theta = \left( 1 - {1 \over n}\right) W + {1 \over n} B$$
• Because of overdispersion in starting values, this is likely to overestimate the true variance.
• The estimator is unbiased if starting values come from the stationary distribution.

## Scale reduction factor

Compute $$\hat{R} = \sqrt{\hat{\rr{Var}} \theta \over W}$$

If \$$\hat{R} \$$ is high, we should run chains longer to improve convergence to the stationary distribution.

• For this diagnostic, need to run multiple chains.
• Which diagnostic is better - Geweke or Gelman-Rubin?

## Gelman-Rubin diagnostic code

mh.draws1 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws2 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws3 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws4 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws5 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.list <- mcmc.list(list(mh.draws1, mh.draws2, mh.draws3, mh.draws4,
+ mh.draws5))
gelman.diag(mh.list)

Use

gelman.plot(mh.list)

## Starting with random points

• Very practical.
• But there isn't much guarantee that we explore the whole space.
• A good starting point for a more substantial algorithm.

## Simulated Tempering

• Data augmentation! Ideas borrowed from physics.
• Every density function can be rewritten in this way: $$f(x) = e^{-(-\rr{log}f(x))}$$
• Define energy \$$E(x) = -\rr{log}f(x)\$$
• Aim to simulate from $$f(x, T) = e^{- {1 \over T} E(x)}$$
• \$$T \in \{ T_1, T_2, \ldots, T_m\}\$$ - pre-defined set of temperature levels (presumably including temperature 1).

## Properties

• Sample from the joint temperature-parameter space.
• Only use the draws from the parameter space with temperature 1.
• Will we ever converge in the temperature space?
• Can this algorithm be parallelized?

## Parallel Tempering

• Aims to use the ideas of simulated tempering.
• Amenable to (designed for) parallelization.
• Idea: run chains at different temperatures in parallel, and "swap" parameter values between them once in a while.
• Goal: improve mixing at temperature 1.

## Definitions

Define composite system $$\Pi (X) = \prod_{i=1}^N {e^{- \beta_i E(x_i)} \over Z(T_i)}$$

with \$$\beta_i = {1 \over T_i }\$$ and \$$Z(T_i) = \int e^{- \beta_i E(x_i)} dx_i\$$.

• \$$\Pi \$$ reduces to \$$f \$$ if \$$T_i = 1 \$$.
• At each iteration \$$t \$$, construct a Markov Chain to sample from \$$E(\cdot) \$$ at \$$T_i\$$ for each \$$i \$$.

## Putting into context

Problem set 5 example

\begin{align} & \rr{log} p(\vec{Z}_i, B \given C_{ijk} ) = \rr{const} + \\ & \sum_{i=1}^I \left( \sum_{j \in J_i} \sum_{k \in K_{ij}} \left[ \vec{x}_{jk}' B - \rr{log} \sum_{l = 1}^{L_j} e^{ \vec{x}_{jl}' B} \right] \right) \cdot \vec{Z}_i \end{align}

Call parameters \$$\theta \$$, look at \$$\rr{log}p(\theta \given y) \$$.

## Define components for Parallel Tempering

• Energy function \$$E(\theta) = - \rr{log} p(\theta \given y) \$$.
• Set energy levels \$$T = \{ T_1, T_2, \ldots, T_n \} \$$
• What is the optimal temperature ladder?
• Composite system $$\Pi (\theta) \propto \prod_{i=1}^n \rr{exp}\left\{-{1\over T_i} E(\theta_i)\right\}$$

## The algorithm

initialize N temperatures x_1, ..., x_N
initialize N replicas x_1, ..., x_N
repeat {
sample a new point x_i for each replica i
with probability theta {
sample an integer j from U[1, N-1]
attempt a swap between replicas j and j + 1
}
}

## Acceptance probabilities

• Local moves \$$w_{\rr{local}} (x_i \rightarrow x_i') = e^{-\beta_i \Delta_i E} = e^{-\beta_i (E(x_i') - E(x_i))}\$$
• Replica exchange across chains

\begin{align} &P_{\rr{replica}} (x_i \leftrightarrow x_{i+1}) = \\ & P( x_1, \ldots, x_{i+1}, x_i, \ldots, x_N \given \ldots, x_i, x_{i+1}, \ldots) \\ & = \rr{min} \left( 1, {\Pi(x_1, \ldots, x_{i+1}, x_i, \ldots, x_N) \over \Pi(x_1, \ldots, x_i, x_{i+1}, \ldots, x_N) }\right)\\ & = \rr{min} \left( 1, e^{(\beta_i - \beta_{i+1})(E(x_i) - E(x_{i+1}))}\right)\end{align}

## How to make it parallel

• Run everything on one node (multicore vs no parallelization).
• Can tell worker nodes to run chains.
• Handling replica exchange communication (main thing):
• All through the master node.
• Allow maximum inter-worker communication, independent of the master.
• Pros and cons of each method?

## Blocking vs non-blocking design

• Blocking design is when your parallel computing system sits idle waiting for something to occur.
• Non-blocking design is when we minimize unnecessary wait times so that the nodes do what they are supposed to do (compute) and don't sit idle.

## Non-blocking design

• Want to minimize inter-node communication
• Running identical random number streams on workers.
• Communicate directly with another worker and not through master.

## PT random number streams

• For this, need to be able to initialize random number streams at the same value and send them to workers.

## Random number generation

• Reference: "Parallel Random Number Generation in C++ and R Using RngStream" by Andrew Karl, Randy Eubank, and Dennis Young
• Use sequence:

\begin{align}x_{1,n} = &(1403580 \cdot x_{1,n−2}−810728\cdot x_{1,n−3}) \\ & \rr{mod } 4294967087 \\ x_{2,n} = & (527612 \cdot x_{2,n−1}−1370589 \cdot x_{2,n−3}) \\ & \rr{mod } 4294944443 \end{align}

## Get uniform RV

The "uniform" random draw is \$$u_n \$$:

\begin{align} z_n & = (x_{1,n} − x_{2,n} ) \rr{ mod } 4294967087 \\ u_n & = \begin{cases} z_n/4294967088 & \rr{if } z_n > 0 \\ 4294967087/4294967088 & \rr{if } z_n = 0 \end{cases}\end{align}

## Given target density, the PT algorithm is almost generic

• The only component that's not generic is the within-chain sampling algorithm.
• What MCMC algorithms could we use for the within-chain sampling?
• This means we can have a generic implementation that we can vary from problem to problem.

## Announcements

• Code leaderboard winner for pset 3!
• Final project second checkpoint.

## Final slide

• Next lecture: MPI for Parallel Tempering.