- Multimodality as a challenge for MCMC
- MCMC diagnostics
- Parallel Tempering algorithm
- General structure of a PT program

- It is a very common feature in likelihood/posterior surfaces.
- What causes it?
- Non-identifiabilities.
- Overfitting.
- Multiple likely parameter configurations.

- To extract all information from the posterior surface, need to be able to effectively sample from it.

- Latent class regression as in the problem set: \\( 24 \times \rr{nclasses} \\) regression parameters.
- Protein folding.
- Any complex system + imperfect modeling.

- Overfitting.
- Model selection.
- Questions:
- How to disambiguate natural versus modeler-imposed multimodality?
- What metrics should we use to know that a given model is sufficient?
- In a world with perfect modelers, would MLE estimation suffice?

- Need to be able to sample from posterior surfaces to get an answer.

- "Running your MCMC long enough"
- Start with random points, and:
- MCMC
- Point estimation + curvature

- Simulated Tempering
- Parallel Tempering
- Equi-energy Sampler

- We don't know how long is "long enough". In addition to visual inspection,
there are several convergence diagnostics:
- Gelman-Rubin
- Geweke
- Raftery-Lewis

- Often, need to run prohibitively long to achieve acceptable mixing.

- Run the MCMC chain for some time.
- Consider two non-overlapping parts of the chain.
- Run a difference in means test adjusting for
autocorrelation.
> geweke.diag(mh.draws) Fraction in 1st window = 0.1 Fraction in 2nd window = 0.5 var1 var2 0.6149 0.1035

- Gelman-Rubin diagnostic is slightly more involved.

- First, need to run multiple chains, for example, \\( m \\) chains.
- Compute within-chain variance estimator for parameter \\( \theta \\) $$W = {1 \over m } \sum_{j=1}^m s_j^2$$ with $$s_j^2 = {1 \over n-1} \sum_{i=1}^n(\theta_{ij} - \bar{\theta}_{j})^2$$
- Likely, \\( W \\) underestimates the true variance.

- Compute between chain variance estimate $$B = { n \over m-1} \sum_{j=1}^m (\bar{\theta}_j - \bar{\bar{\theta}})^2$$ with $$\bar{\bar{\theta}} = {1 \over m} \sum_{j=1}^m \bar{\theta}_j$$
- This is variance of each chain average multimplied by \\( n \\) - the number of draws in each chain (why?)

- Based on \\( W \\) and \\( B \\), estimate variance of stationary distribution $$\hat{\rr{Var}} \theta = \left( 1 - {1 \over n}\right) W + {1 \over n} B$$
- Because of overdispersion in starting values, this is likely to overestimate the true variance.
- The estimator is unbiased if starting values come from the stationary distribution.

Compute $$\hat{R} = \sqrt{\hat{\rr{Var}} \theta \over W}$$

If \\( \hat{R} \\) is high, we should run chains longer to improve convergence to the stationary distribution.

- For this diagnostic, need to run multiple chains.
- Which diagnostic is better - Geweke or Gelman-Rubin?

```
mh.draws1 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws2 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws3 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws4 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.draws5 <- mcmc(NormalMHExample(n.sim = 5000, n.burnin = 0))
mh.list <- mcmc.list(list(mh.draws1, mh.draws2, mh.draws3, mh.draws4,
+ mh.draws5))
gelman.diag(mh.list)
```

Use

`gelman.plot(mh.list)`

- Very practical.
- But there isn't much guarantee that we explore the whole space.
- A good starting point for a more substantial algorithm.

- Data augmentation! Ideas borrowed from physics.
- Every density function can be rewritten in this way: $$f(x) = e^{-(-\rr{log}f(x))}$$
- Define energy \\( E(x) = -\rr{log}f(x)\\)
- Aim to simulate from $$f(x, T) = e^{- {1 \over T} E(x)}$$
- \\( T \in \{ T_1, T_2, \ldots, T_m\}\\) - pre-defined set of temperature levels (presumably including temperature 1).

- Sample from the joint temperature-parameter space.
- Only use the draws from the parameter space with temperature 1.
- Will we ever converge in the temperature space?
- Can this algorithm be parallelized?

- Aims to use the ideas of simulated tempering.
- Amenable to (designed for) parallelization.
- Idea: run chains at different temperatures in parallel, and "swap" parameter values between them once in a while.
- Goal: improve mixing at temperature 1.

Define composite system $$\Pi (X) = \prod_{i=1}^N {e^{- \beta_i E(x_i)} \over Z(T_i)}$$

with \\( \beta_i = {1 \over T_i }\\) and \\( Z(T_i) = \int e^{- \beta_i E(x_i)} dx_i\\).

- \\( \Pi \\) reduces to \\( f \\) if \\( T_i = 1 \\).
- At each iteration \\( t \\), construct a Markov Chain to sample from \\( E(\cdot) \\) at \\(T_i\\) for each \\(i \\).

Problem set 5 example

$$\begin{align} & \rr{log} p(\vec{Z}_i, B \given C_{ijk} ) = \rr{const} + \\ & \sum_{i=1}^I \left( \sum_{j \in J_i} \sum_{k \in K_{ij}} \left[ \vec{x}_{jk}' B - \rr{log} \sum_{l = 1}^{L_j} e^{ \vec{x}_{jl}' B} \right] \right) \cdot \vec{Z}_i \end{align}$$

Call parameters \\( \theta \\), look at \\( \rr{log}p(\theta \given y) \\).

- Energy function \\( E(\theta) = - \rr{log} p(\theta \given y) \\).
- Set energy levels \\( T = \{ T_1, T_2, \ldots, T_n \} \\)
- What is the optimal temperature ladder?

- Composite system $$\Pi (\theta) \propto \prod_{i=1}^n \rr{exp}\left\{-{1\over T_i} E(\theta_i)\right\}$$

initialize N temperatures x_1, ..., x_N initialize N replicas x_1, ..., x_N repeat { sample a new point x_i for each replica i with probability theta { sample an integer j from U[1, N-1] attempt a swap between replicas j and j + 1 } }

- Local moves \\(w_{\rr{local}} (x_i \rightarrow x_i') = e^{-\beta_i \Delta_i E} = e^{-\beta_i (E(x_i') - E(x_i))}\\)
- Replica exchange across chains

$$\begin{align} &P_{\rr{replica}} (x_i \leftrightarrow x_{i+1}) = \\ & P( x_1, \ldots, x_{i+1}, x_i, \ldots, x_N \given \ldots, x_i, x_{i+1}, \ldots) \\ & = \rr{min} \left( 1, {\Pi(x_1, \ldots, x_{i+1}, x_i, \ldots, x_N) \over \Pi(x_1, \ldots, x_i, x_{i+1}, \ldots, x_N) }\right)\\ & = \rr{min} \left( 1, e^{(\beta_i - \beta_{i+1})(E(x_i) - E(x_{i+1}))}\right)\end{align}$$

- Run everything on one node (multicore vs no parallelization).
- Can tell worker nodes to run chains.
- Handling replica exchange communication (main thing):
- All through the master node.
- Allow maximum inter-worker communication, independent of the master.

- Pros and cons of each method?

- Blocking design is when your parallel computing system sits idle waiting for something to occur.
- Non-blocking design is when we minimize unnecessary wait times so that the nodes do what they are supposed to do (compute) and don't sit idle.

- Want to minimize inter-node communication
- Running identical random number streams on workers.
- Communicate directly with another worker and not through master.

- For this, need to be able to initialize random number streams at the same value and send them to workers.

- Reference: "Parallel Random Number Generation in C++ and R Using RngStream" by Andrew Karl, Randy Eubank, and Dennis Young
- Use sequence:

$$\begin{align}x_{1,n} = &(1403580 \cdot x_{1,n−2}−810728\cdot x_{1,n−3}) \\ & \rr{mod } 4294967087 \\ x_{2,n} = & (527612 \cdot x_{2,n−1}−1370589 \cdot x_{2,n−3}) \\ & \rr{mod } 4294944443 \end{align}$$

The "uniform" random draw is \\( u_n \\):

$$\begin{align} z_n & = (x_{1,n} − x_{2,n} ) \rr{ mod } 4294967087 \\ u_n & = \begin{cases} z_n/4294967088 & \rr{if } z_n > 0 \\ 4294967087/4294967088 & \rr{if } z_n = 0 \end{cases}\end{align}$$

- The only component that's not generic
is the within-chain sampling algorithm.
- What MCMC algorithms could we use for the within-chain sampling?

- This means we can have a generic implementation that we can vary from problem to problem.

- Code leaderboard winner for pset 3!
- Final project second checkpoint.

- Slides nesterko.com/lectures/stat221-2012/lecture18
- Class website theory.info/harvardstat221
- Class Piazza piazza.com/class#spring2013/stat221
- Class Twitter twitter.com/harvardstat221

- Next lecture: MPI for Parallel Tempering.