Missing data and MCMC, HMC

Stat 221, Lecture 14

@snesterko

Where we are

  • Implementation immediately feeds back into design!

MCMC and missing data

  • EM integrates over the missing data. $$\begin{align}Q\left( \theta | \theta^{(t)} \right) = \int l(\theta | y) f( Y_{\rr{mis}}| Y_{\rr{obs}}, & \theta = \theta^{(t)}) \\ & dY_{\rr{mis}} \end{align}$$
  • MCMC can help do the same.
    • Draw missing data as if they were parameters.

Maximizing over missing data?

  • Taking something the model says is random and treating it as unknown constant.
  • Can lead to problems if the number of parameters grows with the number of data points (Neyman-Scott problem).
  • MLE efficiency conditions break down.

Gelman radon detectors

  • Radon level of house \\( i \\) in the neighborhood is \\( \mu_i \sim \rr{N}\left( \mu_0, \sigma_0^2\right)\\).
  • \\( \mu_0 \\) is the mean neighbourhood radon level, \\( \sigma_0 \\) is the spread of individual house radon measurements around the neighborhood mean.
  • Conditional on the true house radon level, the detector measures it with error: $$y_i \given \mu_i \sim \rr{N} \left( \mu_i, \sigma^2\right)$$

Neyman-Scott problem

  • There is one \\( \mu_i \\) for each \\( y_i \\).
  • This leads to inconsistent estimate of \\( \sigma_0^2 \\).

EM as a half-Bayesian method

  • What part of EM looks Bayesian?

mcEM

  • Change the E step
  • Replace \\( E (Y_{\rr{mis}} \given Y_{\rr{obs}}, \theta) \\) by its estimate
  • Obtain the estimate by simulating from the distribution $$ f(Y_{\rr{mis}} \given Y_{\rr{obs}}, \theta) $$
  • What are the possible caveats?

When to apply mcEM

  • Quick and dirty approximate method.
  • Hard to explicitly find the expected value in the E step.

Example: document authorship

  • Document authorship \\( \pi_d \sim \rr{Dirichlet}_K(\vec{\alpha},1) \\).
  • Data \\( y_{dw} \given \pi_d \sim \rr{Poisson}(\sum_k\mu_{kw}\pi_{dk})\\).
  • Author \\( k\\), document \\( d \\), word \\( w \\).
  • \\( \pi_{dk} \\) represents the amount of contribution of author \\( k \\) do document \\(d\\).

Likelihood

For document \\( d \\):

$$\begin{align}L(\alpha, \mu \given y, \pi) \propto & \prod_{w=1}^W\left[\left(\sum_k\mu_{kw}\pi_{dk}\right)^{y_{dw}} \right. \\ & \left.\cdot e^{-\left(\sum_k\mu_{kw}\pi_{dk}\right)}\right] \cdot \prod_{k=1}^K \pi_{dk}^{\alpha_k - 1}\end{align}$$

The expectation of the \\( Q \\) function is hard to get in closed form.

Caveats of mcEM

  • Estimation of the expected values associated with the latent variables.
  • Especially challenging for high-dimensional distributions, requires many MC iterations in the E step.

Full MCMC and EM connection

  • Point estimation versus full posterior inference.
  • Full MCMC erases the difference between a parameter and a latent variable.
  • MCMC "contains" point estimation, EM can recreate the results of MCMC only under Normality (practically, never).

Non-normality

Discrete vs continuous rvs

Discrete rvs are special:

  • Absence of the Jacobian when changing variables.
  • Hard to construct symmetric jumping rules for MCMC in the discrete case.
  • Discrete rvs are hard to deal with.

Example: HMMs

Drawing \\( h \\) is hard.

  • Baum-Welch and full Bayes don't work well.

Posterior for HMMs

$$\begin{align}p(\theta, H \given Y ) \propto & f_{\theta}(y_1 \given h_1)g_\theta(h_1) \cdot \\ & \prod_{i=2}^n f_{\theta}(y_i \given h_i)g_\theta(h_i \given h_{i-1})\end{align}$$

  • Even for a 2-state HMM, if a sample size is non-trivial (i.e. has \\( n > 10 \\)), it becomes hard to find high-probability \\( H \\) chains via MCMC (\\( 2^n \\) possible chains).

Hamiltonian Monte Carlo

  • MCMC to sample from distributions of continuous rvs.
  • Uses data augmentation strategy (hacker level!).
  • Leverages physics to get draws far away from the initial values.

Example: Cauchy model

$$y_i \given \theta \mathop{\sim}^{\rr{ind}} { 1 \over (y_i - \theta)^2 + 1 }$$

Observe data \\( y_1 = 4, \rr{ } y_2 = -4 \\). Posterior:

$$p(\theta \given Y) \propto { 1 \over (4 - \theta)^2 + 1 } \cdot { 1 \over (4 + \theta)^2 + 1 }$$

  • Bimodal distribution.

HMC: how it works on Cauchy model

Summary of HMC

  • Simulates Hamiltonian dynamics.
  • Moves around the space faster.
  • "Strictly better" than classic Metropolis for continuous rvs (converges faster, smaller autocorrelation).
  • Like classic Metropolis, gets trapped in local modes.

Questions inspired by HMC

  • Is there an equivalent for discrete rvs?
  • How do we deal with separated local modes?

Announcements

  • T-shirt comp!
  • Use Piazza, share ideas about homework.
  • Pset4 due next Wednesday.
  • Final project: method design meetings, end-of-semester presentation.

Resources

Final slide

  • Next lecture: HMC nuts and bolts