# EM variants, data augmentation

### Stat 221, Lecture 12

@snesterko

• Finish EM for HMM.
• Solve the mystery of the Q function.
• Look at other versions of EM:
• Tweak the M step (ECM).
• Use EM via data augmentation.
• Tweak the E step (PX-EM and mcEM).

## The ECM algorithm

• Expectation-Conditional Maximization.
• Fix some parameters, maximize over others, and iterate.
• Little and Rubin pp. 179-183, 223-226.

## Example

• Multivariate regression with incomplete data

$$Y_i \sim N_K \left( X_i \beta , \Sigma\right), i = 1, \ldots, n$$

## CM step

\begin{align}\beta^{(t + 1)} = & \left( \sum_{i=1}^n X_i^T (\Sigma^{(t)})^{-1}X_i\right)^{-1} \\ & \cdot \left( \sum_{i=1}^n X_i^T (\Sigma^{(t)})^{-1}Y_i\right)\end{align}

$$\Sigma^{t+1} = \frac{1}{n} \sum_{i=1}^n (Y_i - X_i \beta^{(t+1)}) (Y_i - X_i \beta^{(t+1)})^T$$

## Complete-data likelihood increases

\begin{align} l( \beta^{(t+1)}, \Sigma^{(t)} \given Y) & \geq l( \beta^{(t)}, \Sigma^{(t)} \given Y) \\ l( \beta^{(t+1)}, \Sigma^{(t+1)} \given Y) & \geq l( \beta^{(t+1)}, \Sigma^{(t)} \given Y)\end{align}

## E step

\begin{align}&l(\mu, \Sigma \given Y_{\rr{obs}}) = const - \frac{1}{2} \sum_{i=1}^n log |\Sigma_{\rr{obs}, i}| \\ & - \frac{1}{2} \sum_{i=1}^n (y_{\rr{obs}, i} -\mu_{\rr{obs}, i})^T \Sigma^{-1}_{\rr{obs}, i} (y_{\rr{obs}, i} - \mu_{\rr{obs}, i})\end{align}

• Why \$$\mu_{\rr{obs}, i}, \Sigma_{\rr{obs}, i}\$$ have an \$$i \$$?

## Complete-data sufficient statistics

\begin{align}S = &\left( \sum_{i=1}^n y_{ij}, \rr{ } j= 1, \ldots K; \right. \\ & \left. \sum_{i=1}^n y_{ij} y_{ik}, \rr{ } j, k = 1, \ldots, K\right)\end{align}

## More on E step

• Let \$$\theta^{(t)} = \left( \mu^{(t)}, \Sigma^{(t)}\right) \$$

$$E \left( \sum_{i=1}^n y_{ij} \given Y_{\rr{obs}}, \theta^{(t)}\right) = \sum_{i=1}^n y_{ij}^{(t)}, \rr{ } j=1, \ldots, K$$

\begin{align}E \left( \sum_{i=1}^n y_{ij}y_{ik} \given Y_{\rr{obs}}, \theta^{(t)}\right) = & \sum_{i=1}^n (y_{ij}^{(t)}y_{ik}^{(t)} + c_{jki}^{(t)}), \\ &j,k=1, \ldots, K\end{align}

## Definitions

$$y_{ij}^{(t)} = \begin{cases} y_{ij} & y_{ij} \rr{ is observed} \\ E(y_{ij} \given y_{\rr{obs}, i}, \theta^{(t)}) & y_{ij} \rr{ is missing}\end{cases}$$

$$c_{jki}^{(t)} = \begin{cases} 0 & y_{ij} \rr{ or } y_{ik} \rr{ obs} \\ \rr{Cov}(y_{ij}, y_{ik} \given y_{\rr{obs}, i}, \theta^{(t)}) & \rr{otherwise}\end{cases}$$

## How do we calculate the conditional expectations?

• Multivariate Normal, condition on observed data.

## Step back

• ECM and Gibbs connection.
• Divide and conquer.

## EM and data augmentation

• Data augmentation is hacker-level modeling.
• Using EM when there is no missing data.
• Expanding parameter space - PX-EM.

## Example: \$$t \$$ distribution

$$f(x_i \given \theta) \sim \frac{\Gamma(\nu / 2 + 1/2)}{\sqrt{\pi\nu \sigma^2} \Gamma(\nu / 2) \left[1 + \frac{(x_i - \mu)^2}{\nu \sigma^2}\right]^{(\nu + 1) / 2}}$$

• Not an exponential family.
• Parameters \$$\theta = (\mu, \sigma) \$$.
• Define augmented complete dataset \$$Y = ( Y_{\rr{obs}}, Y_{\rr{mis}} ) \$$ with \$$Y_{\rr{obs}} = X \$$ and \$$Y_{\rr{mis}} = W = (w_1, \ldots, w_n) \$$

## Augmented data

$$x_i \given \theta, w_i \mathop{\sim}^{\rr{ind}} N( \mu, \sigma^2/w_i), \rr{ } w_i \given \theta \sim \chi^2_\nu / \nu$$

• Complete data is an exponential family with sufficient statistics

$$s_0 = \sum_{i=1}^n w_i, \rr{ } s_1 = \sum_{i=1}^n w_i x_i, \rr{ } s_2 = \sum_{i=1}^n w_i x_i^2$$

• Analytic E and M steps.
• Little and Rubin pp. 175-176.

## E step

$$w_i^{(t)} = E(w_i \given x_i, \theta) = \frac{\nu + 1}{\nu + d_i^{(t)^2}}$$

• where \$$d_i^{(t)} = (x_i - \mu^{(t)}) / \sigma^{(t)} \$$.

## M step

\begin{align}\mu^{(t+1)} & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \hat{\sigma}^{(t+1)^2} & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \hat{\mu}^{(t+1)})^2 \\ & = \frac{s_2^{(t)} - s_1^{(t)^2} / s_0^{(t)}}{n}\end{align}

• What is it?

## PX-EM

• Augment parameter space.
• Define new parameters \$$\phi = (\theta^*, \alpha) \$$
• Original model is obtained by setting \$$\alpha = \alpha_0 \$$.
• When \$$\alpha = \alpha_0 \$$, \$$\theta^* = \theta \$$.
• Dimension of \$$\theta^* \$$ is the same as the dimension of \$$\theta \$$.
• \$$\theta = R( \theta^*, \alpha) \$$ for known transformation \$$R \$$.
• Convergence speeds up.
• Little and Rubin pp. 184-186.

## EM as a half-Bayesian method

• What part of EM looks Bayesian?

## mcEM

• Change the E step
• Replace \$$E (Y_{\rr{mis}} \given Y_{\rr{obs}}, \theta) \$$ by its estimate
• Obtain the estimate by simulating from the distribution $$f(Y_{\rr{mis}} \given Y_{\rr{obs}}, \theta)$$
• What are the possible caveats?

## Announcements

• T-shirt winner announcement Wednesday - submit your entry!

## Final slide

• Next lecture: Likelihood + Prior = Posterior, MCMC