EM variants, data augmentation

Stat 221, Lecture 12

@snesterko

Roadmap

  • Finish EM for HMM.
  • Solve the mystery of the Q function.
  • Look at other versions of EM:
    • Tweak the M step (ECM).
    • Use EM via data augmentation.
    • Tweak the E step (PX-EM and mcEM).

The mystery of the \\( Q \\) function

The ECM algorithm

  • Expectation-Conditional Maximization.
  • Fix some parameters, maximize over others, and iterate.
  • Little and Rubin pp. 179-183, 223-226.

Example

  • Multivariate regression with incomplete data

$$Y_i \sim N_K \left( X_i \beta , \Sigma\right), i = 1, \ldots, n$$

CM step

$$\begin{align}\beta^{(t + 1)} = & \left( \sum_{i=1}^n X_i^T (\Sigma^{(t)})^{-1}X_i\right)^{-1} \\ & \cdot \left( \sum_{i=1}^n X_i^T (\Sigma^{(t)})^{-1}Y_i\right)\end{align}$$

$$\Sigma^{t+1} = \frac{1}{n} \sum_{i=1}^n (Y_i - X_i \beta^{(t+1)}) (Y_i - X_i \beta^{(t+1)})^T$$

Complete-data likelihood increases

$$\begin{align} l( \beta^{(t+1)}, \Sigma^{(t)} \given Y) & \geq l( \beta^{(t)}, \Sigma^{(t)} \given Y) \\ l( \beta^{(t+1)}, \Sigma^{(t+1)} \given Y) & \geq l( \beta^{(t+1)}, \Sigma^{(t)} \given Y)\end{align}$$

E step

$$\begin{align}&l(\mu, \Sigma \given Y_{\rr{obs}}) = const - \frac{1}{2} \sum_{i=1}^n log |\Sigma_{\rr{obs}, i}| \\ & - \frac{1}{2} \sum_{i=1}^n (y_{\rr{obs}, i} -\mu_{\rr{obs}, i})^T \Sigma^{-1}_{\rr{obs}, i} (y_{\rr{obs}, i} - \mu_{\rr{obs}, i})\end{align}$$

  • Why \\( \mu_{\rr{obs}, i}, \Sigma_{\rr{obs}, i}\\) have an \\( i \\)?

Complete-data sufficient statistics

$$\begin{align}S = &\left( \sum_{i=1}^n y_{ij}, \rr{ } j= 1, \ldots K; \right. \\ & \left. \sum_{i=1}^n y_{ij} y_{ik}, \rr{ } j, k = 1, \ldots, K\right)\end{align}$$

More on E step

  • Let \\( \theta^{(t)} = \left( \mu^{(t)}, \Sigma^{(t)}\right) \\)

$$E \left( \sum_{i=1}^n y_{ij} \given Y_{\rr{obs}}, \theta^{(t)}\right) = \sum_{i=1}^n y_{ij}^{(t)}, \rr{ } j=1, \ldots, K$$

$$\begin{align}E \left( \sum_{i=1}^n y_{ij}y_{ik} \given Y_{\rr{obs}}, \theta^{(t)}\right) = & \sum_{i=1}^n (y_{ij}^{(t)}y_{ik}^{(t)} + c_{jki}^{(t)}), \\ &j,k=1, \ldots, K\end{align}$$

Definitions

$$y_{ij}^{(t)} = \begin{cases} y_{ij} & y_{ij} \rr{ is observed} \\ E(y_{ij} \given y_{\rr{obs}, i}, \theta^{(t)}) & y_{ij} \rr{ is missing}\end{cases}$$

$$c_{jki}^{(t)} = \begin{cases} 0 & y_{ij} \rr{ or } y_{ik} \rr{ obs} \\ \rr{Cov}(y_{ij}, y_{ik} \given y_{\rr{obs}, i}, \theta^{(t)}) & \rr{otherwise}\end{cases}$$

How do we calculate the conditional expectations?

  • Multivariate Normal, condition on observed data.

Step back

  • ECM and Gibbs connection.
  • Divide and conquer.

EM and data augmentation

  • Data augmentation is hacker-level modeling.
  • Using EM when there is no missing data.
  • Expanding parameter space - PX-EM.

Example: \\( t \\) distribution

$$f(x_i \given \theta) \sim \frac{\Gamma(\nu / 2 + 1/2)}{\sqrt{\pi\nu \sigma^2} \Gamma(\nu / 2) \left[1 + \frac{(x_i - \mu)^2}{\nu \sigma^2}\right]^{(\nu + 1) / 2}}$$

  • Not an exponential family.
  • Parameters \\( \theta = (\mu, \sigma) \\).
  • Define augmented complete dataset \\( Y = ( Y_{\rr{obs}}, Y_{\rr{mis}} ) \\) with \\( Y_{\rr{obs}} = X \\) and \\( Y_{\rr{mis}} = W = (w_1, \ldots, w_n) \\)

Augmented data

$$x_i \given \theta, w_i \mathop{\sim}^{\rr{ind}} N( \mu, \sigma^2/w_i), \rr{ } w_i \given \theta \sim \chi^2_\nu / \nu$$

  • Complete data is an exponential family with sufficient statistics

$$s_0 = \sum_{i=1}^n w_i, \rr{ } s_1 = \sum_{i=1}^n w_i x_i, \rr{ } s_2 = \sum_{i=1}^n w_i x_i^2$$

  • Analytic E and M steps.
  • Little and Rubin pp. 175-176.

E step

$$w_i^{(t)} = E(w_i \given x_i, \theta) = \frac{\nu + 1}{\nu + d_i^{(t)^2}}$$

  • where \\( d_i^{(t)} = (x_i - \mu^{(t)}) / \sigma^{(t)} \\).

M step

$$\begin{align}\mu^{(t+1)} & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \hat{\sigma}^{(t+1)^2} & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \hat{\mu}^{(t+1)})^2 \\ & = \frac{s_2^{(t)} - s_1^{(t)^2} / s_0^{(t)}}{n}\end{align}$$

Achilles' heel of EM

  • What is it?

PX-EM

  • Augment parameter space.
  • Define new parameters \\( \phi = (\theta^*, \alpha) \\)
  • Original model is obtained by setting \\( \alpha = \alpha_0 \\).
  • When \\( \alpha = \alpha_0 \\), \\( \theta^* = \theta \\).
  • Dimension of \\( \theta^* \\) is the same as the dimension of \\( \theta \\).
  • \\( \theta = R( \theta^*, \alpha) \\) for known transformation \\( R \\).
  • Convergence speeds up.
  • Little and Rubin pp. 184-186.

EM as a half-Bayesian method

  • What part of EM looks Bayesian?

mcEM

  • Change the E step
  • Replace \\( E (Y_{\rr{mis}} \given Y_{\rr{obs}}, \theta) \\) by its estimate
  • Obtain the estimate by simulating from the distribution $$ f(Y_{\rr{mis}} \given Y_{\rr{obs}}, \theta) $$
  • What are the possible caveats?

Announcements

  • T-shirt winner announcement Wednesday - submit your entry!

Resources

Final slide

  • Next lecture: Likelihood + Prior = Posterior, MCMC