EM variants, data augmentation

Stat 221, Lecture 12

@snesterko

Roadmap

Finish EM for HMM.
Solve the mystery of the Q function.
Look at other versions of EM:

Tweak the M step (ECM).
Use EM via data augmentation.
Tweak the E step (PX-EM and mcEM).

The mystery of the \$ Q \$ function

Class website: student visualization!

The ECM algorithm

Expectation-Conditional Maximization.
Fix some parameters, maximize over others, and iterate.
Little and Rubin pp. 179-183, 223-226.

Example

Multivariate regression with incomplete data

$$Y_i \sim N_K \left( X_i \beta , \Sigma\right), i = 1, \ldots, n$$

CM step

$$\begin{align}\beta^{(t + 1)} = & \left( \sum_{i=1}^n X_i^T (\Sigma^{(t)})^{-1}X_i\right)^{-1} \\ & \cdot \left( \sum_{i=1}^n X_i^T (\Sigma^{(t)})^{-1}Y_i\right)\end{align}$$

$$\Sigma^{t+1} = \frac{1}{n} \sum_{i=1}^n (Y_i - X_i \beta^{(t+1)}) (Y_i - X_i \beta^{(t+1)})^T$$

Complete-data likelihood increases

$$\begin{align} l( \beta^{(t+1)}, \Sigma^{(t)} \given Y) & \geq l( \beta^{(t)}, \Sigma^{(t)} \given Y) \\ l( \beta^{(t+1)}, \Sigma^{(t+1)} \given Y) & \geq l( \beta^{(t+1)}, \Sigma^{(t)} \given Y)\end{align}$$

E step

$$\begin{align}&l(\mu, \Sigma \given Y_{\rr{obs}}) = const - \frac{1}{2} \sum_{i=1}^n log |\Sigma_{\rr{obs}, i}| \\ & - \frac{1}{2} \sum_{i=1}^n (y_{\rr{obs}, i} -\mu_{\rr{obs}, i})^T \Sigma^{-1}_{\rr{obs}, i} (y_{\rr{obs}, i} - \mu_{\rr{obs}, i})\end{align}$$

Why \$ \mu_{\rr{obs}, i}, \Sigma_{\rr{obs}, i}\$ have an \$ i \$?

Complete-data sufficient statistics

$$\begin{align}S = &\left( \sum_{i=1}^n y_{ij}, \rr{ } j= 1, \ldots K; \right. \\ & \left. \sum_{i=1}^n y_{ij} y_{ik}, \rr{ } j, k = 1, \ldots, K\right)\end{align}$$

Definitions

$$y_{ij}^{(t)} = \begin{cases} y_{ij} & y_{ij} \rr{ is observed} \\ E(y_{ij} \given y_{\rr{obs}, i}, \theta^{(t)}) & y_{ij} \rr{ is missing}\end{cases}$$

$$c_{jki}^{(t)} = \begin{cases} 0 & y_{ij} \rr{ or } y_{ik} \rr{ obs} \\ \rr{Cov}(y_{ij}, y_{ik} \given y_{\rr{obs}, i}, \theta^{(t)}) & \rr{otherwise}\end{cases}$$

How do we calculate the conditional expectations?

Multivariate Normal, condition on observed data.

Step back

ECM and Gibbs connection.
Divide and conquer.

EM and data augmentation

Data augmentation is hacker-level modeling.
Using EM when there is no missing data.
Expanding parameter space - PX-EM.

Example: \$ t \$ distribution

$$f(x_i \given \theta) \sim \frac{\Gamma(\nu / 2 + 1/2)}{\sqrt{\pi\nu \sigma^2} \Gamma(\nu / 2) \left[1 + \frac{(x_i - \mu)^2}{\nu \sigma^2}\right]^{(\nu + 1) / 2}}$$

Not an exponential family.
Parameters \$ \theta = (\mu, \sigma) \$.
Define augmented complete dataset \$ Y = ( Y_{\rr{obs}}, Y_{\rr{mis}} ) \$ with \$ Y_{\rr{obs}} = X \$ and \$ Y_{\rr{mis}} = W = (w_1, \ldots, w_n) \$

Augmented data

$$x_i \given \theta, w_i \mathop{\sim}^{\rr{ind}} N( \mu, \sigma^2/w_i), \rr{ } w_i \given \theta \sim \chi^2_\nu / \nu$$

Complete data is an exponential family with sufficient statistics

$$s_0 = \sum_{i=1}^n w_i, \rr{ } s_1 = \sum_{i=1}^n w_i x_i, \rr{ } s_2 = \sum_{i=1}^n w_i x_i^2$$

Analytic E and M steps.
Little and Rubin pp. 175-176.

E step

$$w_i^{(t)} = E(w_i \given x_i, \theta) = \frac{\nu + 1}{\nu + d_i^{(t)^2}}$$

where \$ d_i^{(t)} = (x_i - \mu^{(t)}) / \sigma^{(t)} \$.

M step

$$\begin{align}\mu^{(t+1)} & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \hat{\sigma}^{(t+1)^2} & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \hat{\mu}^{(t+1)})^2 \\ & = \frac{s_2^{(t)} - s_1^{(t)^2} / s_0^{(t)}}{n}\end{align}$$

Achilles' heel of EM

What is it?

PX-EM

Augment parameter space.
Define new parameters \$ \phi = (\theta^*, \alpha) \$
Original model is obtained by setting \$ \alpha = \alpha_0 \$.
When \$ \alpha = \alpha_0 \$, \$ \theta^* = \theta \$.
Dimension of \$ \theta^* \$ is the same as the dimension of \$ \theta \$.
\$ \theta = R( \theta^*, \alpha) \$ for known transformation \$ R \$.
Convergence speeds up.
Little and Rubin pp. 184-186.

EM as a half-Bayesian method

What part of EM looks Bayesian?

mcEM

Change the E step
Replace \$ E (Y_{\rr{mis}} \given Y_{\rr{obs}}, \theta) \$ by its estimate
Obtain the estimate by simulating from the distribution $$ f(Y_{\rr{mis}} \given Y_{\rr{obs}}, \theta) $$
What are the possible caveats?

Announcements

T-shirt winner announcement Wednesday - submit your entry!

Resources

Slides nesterko.com/lectures/stat221-2012/lecture12
Class website theory.info/harvardstat221
Class Piazza piazza.com/class#spring2013/stat221
Class Twitter twitter.com/harvardstat221

Final slide

Next lecture: Likelihood + Prior = Posterior, MCMC

EM variants, data augmentation

Stat 221, Lecture 12

Roadmap

The mystery of the \\( Q \\) function

The ECM algorithm

Example

CM step

Complete-data likelihood increases

E step

Complete-data sufficient statistics

More on E step

Definitions

How do we calculate the conditional expectations?

Step back

EM and data augmentation

Example: \\( t \\) distribution

Augmented data

E step

M step

Achilles' heel of EM

PX-EM

EM as a half-Bayesian method

mcEM

Announcements

Resources

Final slide