# Missing data and MCMC, HMC

@snesterko

### Where we are

• Implementation immediately feeds back into design!

## MCMC and missing data

• EM integrates over the missing data. \begin{align}Q\left( \theta | \theta^{(t)} \right) = \int l(\theta | y) f( Y_{\rr{mis}}| Y_{\rr{obs}}, & \theta = \theta^{(t)}) \\ & dY_{\rr{mis}} \end{align}
• MCMC can help do the same.
• Draw missing data as if they were parameters.

## Maximizing over missing data?

• Taking something the model says is random and treating it as unknown constant.
• Can lead to problems if the number of parameters grows with the number of data points (Neyman-Scott problem).
• MLE efficiency conditions break down.

• Radon level of house \$$i \$$ in the neighborhood is \$$\mu_i \sim \rr{N}\left( \mu_0, \sigma_0^2\right)\$$.
• \$$\mu_0 \$$ is the mean neighbourhood radon level, \$$\sigma_0 \$$ is the spread of individual house radon measurements around the neighborhood mean.
• Conditional on the true house radon level, the detector measures it with error: $$y_i \given \mu_i \sim \rr{N} \left( \mu_i, \sigma^2\right)$$

## Neyman-Scott problem

• There is one \$$\mu_i \$$ for each \$$y_i \$$.
• This leads to inconsistent estimate of \$$\sigma_0^2 \$$.

## EM as a half-Bayesian method

• What part of EM looks Bayesian?

## mcEM

• Change the E step
• Replace \$$E (Y_{\rr{mis}} \given Y_{\rr{obs}}, \theta) \$$ by its estimate
• Obtain the estimate by simulating from the distribution $$f(Y_{\rr{mis}} \given Y_{\rr{obs}}, \theta)$$
• What are the possible caveats?

## When to apply mcEM

• Quick and dirty approximate method.
• Hard to explicitly find the expected value in the E step.

## Example: document authorship

• Document authorship \$$\pi_d \sim \rr{Dirichlet}_K(\vec{\alpha},1) \$$.
• Data \$$y_{dw} \given \pi_d \sim \rr{Poisson}(\sum_k\mu_{kw}\pi_{dk})\$$.
• Author \$$k\$$, document \$$d \$$, word \$$w \$$.
• \$$\pi_{dk} \$$ represents the amount of contribution of author \$$k \$$ do document \$$d\$$.

## Likelihood

For document \$$d \$$:

\begin{align}L(\alpha, \mu \given y, \pi) \propto & \prod_{w=1}^W\left[\left(\sum_k\mu_{kw}\pi_{dk}\right)^{y_{dw}} \right. \\ & \left.\cdot e^{-\left(\sum_k\mu_{kw}\pi_{dk}\right)}\right] \cdot \prod_{k=1}^K \pi_{dk}^{\alpha_k - 1}\end{align}

The expectation of the \$$Q \$$ function is hard to get in closed form.

## Caveats of mcEM

• Estimation of the expected values associated with the latent variables.
• Especially challenging for high-dimensional distributions, requires many MC iterations in the E step.

## Full MCMC and EM connection

• Point estimation versus full posterior inference.
• Full MCMC erases the difference between a parameter and a latent variable.
• MCMC "contains" point estimation, EM can recreate the results of MCMC only under Normality (practically, never).

## Discrete vs continuous rvs

Discrete rvs are special:

• Absence of the Jacobian when changing variables.
• Hard to construct symmetric jumping rules for MCMC in the discrete case.
• Discrete rvs are hard to deal with.

## Example: HMMs

Drawing \$$h \$$ is hard.

• Baum-Welch and full Bayes don't work well.

## Posterior for HMMs

\begin{align}p(\theta, H \given Y ) \propto & f_{\theta}(y_1 \given h_1)g_\theta(h_1) \cdot \\ & \prod_{i=2}^n f_{\theta}(y_i \given h_i)g_\theta(h_i \given h_{i-1})\end{align}

• Even for a 2-state HMM, if a sample size is non-trivial (i.e. has \$$n > 10 \$$), it becomes hard to find high-probability \$$H \$$ chains via MCMC (\$$2^n \$$ possible chains).

## Hamiltonian Monte Carlo

• MCMC to sample from distributions of continuous rvs.
• Uses data augmentation strategy (hacker level!).
• Leverages physics to get draws far away from the initial values.

## Example: Cauchy model

$$y_i \given \theta \mathop{\sim}^{\rr{ind}} { 1 \over (y_i - \theta)^2 + 1 }$$

Observe data \$$y_1 = 4, \rr{ } y_2 = -4 \$$. Posterior:

$$p(\theta \given Y) \propto { 1 \over (4 - \theta)^2 + 1 } \cdot { 1 \over (4 + \theta)^2 + 1 }$$

• Bimodal distribution.

## Summary of HMC

• Simulates Hamiltonian dynamics.
• Moves around the space faster.
• "Strictly better" than classic Metropolis for continuous rvs (converges faster, smaller autocorrelation).
• Like classic Metropolis, gets trapped in local modes.

## Questions inspired by HMC

• Is there an equivalent for discrete rvs?
• How do we deal with separated local modes?

## Announcements

• T-shirt comp!
• Use Piazza, share ideas about homework.
• Pset4 due next Wednesday.
• Final project: method design meetings, end-of-semester presentation.

## Final slide

• Next lecture: HMC nuts and bolts