EM for HMMs, properties of EM

Stat 221, Lecture 11

@snesterko

Emphasis of this lecture

  • EM for HMMs - forward-backward algorithm.
  • Thinking about R-friendly ways to perform computation.
  • Understanding the benefits and limitations of EM in non-trivial settings.

Roadmap

  1. Derive EM for HMMs.
  2. Make a break to critique a visualization on EM (homework).
  3. Keep deriving EM for HMMs and thinking about the relationship with computing and inference:
    • How to do HMM EM computation faster in R?
    • Does the HMM EM ever converge?

EM for HMMs

  • \\(\{h_1, h_2, \ldots \} \\) live on discrete space, follow a Markov Process with a transition matrix \\( A \\).
  • At time \\( t\\), state \\( h_t \\) emits observation \\( y_t \\) according to some specified model.

A different view

Notation change

  • To be consistent with Bishop pp. 610-625.
  • Observed data \\( X \\).
  • Hidden states \\( Z \\).
  • Hidden states transition matrix \\( A_{k \times k} \\).

Complete-data likelihood

$$\begin{align}p(X, Z \given \theta) \propto & p(x_1 \given z_1, \phi)p(z_1 \given \pi) \cdot \\ & \prod_{n=2}^N p(z_n \given z_{n-1}, A)p(x_n \given z_n, \phi)\end{align}$$

Running man HMM

Complete data likelihood

Let \\( N = 3 \\).

$$\begin{align} L(\theta \given X, Z ) \propto & \left(\rr{Bern}(x_1|p_l)\right)^{z_1}\left(\rr{Bern}(x_1|p_s)\right)^{1-z_1} \\ \prod_{i=2}^3 & \left(\rr{Bern}(x_i|p_l)\right)^{z_i}\left(\rr{Bern}(x_i|p_s)\right)^{1-z_i} \\ & \cdot T(z_i \given z_{i-1})\end{align}$$

Observed-data likelihood

$$p(X \given \theta) \propto \sum_{Z} p(X, Z \given \theta)$$

  • Need to sum over \\( K^N \\) terms.
  • Impractical to do this directly.

Observed data likelihood

\\( Z = \{ (0, 0, 0), (0, 0, 1), (0, 1, 0), \ldots, (1, 1, 1)\} \\) - \\( 2^3 = 8\\) terms

$$\begin{align} L \propto & \sum_{z \in Z} \left[\left(\rr{Bern}(x_1|p_l)\right)^{z_1}\left(\rr{Bern}(x_1|p_s)\right)^{1-z_1} \right.\\ \prod_{i=2}^3 & \left(\rr{Bern}(x_i|p_l)\right)^{z_i}\left(\rr{Bern}(x_i|p_s)\right)^{1-z_i} \\ & \left.\cdot T(z_i \given z_{i-1}) \right]\end{align}$$

Define

  • \\( \gamma (z_n) = p (z_n \given X, \theta^{(old)}) \\)
  • \\( \xi (z_{n-1}, z_n) = p (z_{n-1}, z_n \given X, \theta^{(old)}) \\)
  • Also let
    • \\( \gamma( z_{nk}) = P( z_{nk} = 1 \given X, \theta^{(old)}) \\) - conditional probability that \\( Z_n \\) is in state \\( k \\).
    • \\( \xi( z_{n-1,j}, z_{nk}) = \\) \\(P( z_{n-1,j} = 1, z_{nk} = 1 \given X, \theta^{(old)}) \\) - conditional probability that \\( Z_{n-1} \\) is in state \\( j \\) and \\( Z_n \\) is in state \\( k \\).

\\( \gamma \\) and \\( \xi \\) in the example

  • \\( \gamma (z_{n1}) = P(\rr{lazy on day }n \given X, \theta^{(old)}) \\)
  • \\( \xi (z_{n-1, 1}, z_{n2}) = P(\rr{lazy on day }n-1, \\) \\( \rr{ sporty on day }n \given X, \theta^{(old)}) \\)
  • \\( X \\) - observed running data.
  • \\( \theta^{(old)} \\) - past iteration of HMM transition parameters of \\( A \\) and emission parameters.

\\( Q \\) function

$$\begin{align}Q(\theta, \theta^{(old)}) = & \sum_{k=1}^K\gamma(z_{1k})\rr{log} \pi_k + \\ & \sum_{n=2}^N \sum_{j=1}^K \sum_{k=1}^K \xi(z_{n-1, j}, z_{nk}) \rr{log} A_{jk} \\ & + \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) \rr{log} p( x_n \given \phi_k) \end{align}$$

Example \\( Q \\) function

$$\begin{align}Q(\theta, \theta^{(old)}) = & \sum_{k=1}^2\gamma(z_{1k})\rr{log} 0.5 + \\ & \sum_{n=2}^3 \sum_{j=1}^2 \sum_{k=1}^2 \xi(z_{n-1, j}, z_{nk}) \rr{log} A_{jk} + \\ & \sum_{n=1}^3 \sum_{k=1}^2 \gamma(z_{nk}) \rr{log} \rr{Bern}(x_i|(p_l /p_s)) \end{align}$$

M step

Constraint: \\( \pi \\) and rows of \\( A \\) need to sum to 1.

$$ \pi_k = \frac{\gamma(z_{1k})}{\sum_{j=1}^K \gamma(z_{1j})} $$

$$ A_{jk} = \frac{\sum_{n=2}^N \xi(z_{n-1,j}, z_{nk})}{\sum_{l=1}^K \sum_{n=2}^N \xi(z_{n-1, j}, z_{nl})}$$

  • The weighting is common in EM algorithms. Likelihood weights.

Example transitions M step

  • Prior is flat, so nothing for \\( \pi \\).
  • Same for the transition probabilities: $$ A_{jk} = \frac{\sum_{n=2}^N \xi(z_{n-1,j}, z_{nk})}{\sum_{l=1}^K \sum_{n=2}^N \xi(z_{n-1, j}, z_{nl})}$$

M step for emission params

$$ \mu_k = \frac{\sum_{n=1}^N\gamma(z_{nk}) x_n}{\sum_{n=1}^N \gamma(z_{nk})} $$

$$ \Sigma_k = \frac{\sum_{n=1}^N\gamma(z_{nk}) (x_n - \mu_k)(x_n - \mu_k)^T}{\sum_{n=1}^N \gamma(z_{nk})} $$

  • This is for the Normal emission case.

Example emission M step

Sporty running probability:

$$ p_s = {\sum_{n=1}^3 \gamma(z_{n1}) x_n\over \sum_{n=1}^3 \gamma(z_{n1}) }$$

Lazy running probability:

$$ p_l = {\sum_{n=1}^3 \gamma(z_{n2}) x_n\over \sum_{n=1}^3 \gamma(z_{n2}) }$$

Breather: EM and KL divergence

+ visualization critique

Conditional independence properties

$$\begin{align} p(X \given z_n) = & p(x_1, \ldots, x_n \given z_n) \cdot \\ & p(x_{n+1}, \ldots, x_N \given z_n) \\ p(z_{N+1} \given z_N, X) = & p(z_{N+1} \given z_N) \end{align}$$

  • Others.

Note that

$$\gamma (z_n ) = p(z_n \given X) = { p(X \given z_n)p(z_n) \over p(X)}$$

All this is conditional on \\( \theta^{\rr{old}} \\), so \\( p(X) \\) is the likelihood.

Then

$$\gamma(z_n) = {p(x_1, \ldots, x_n, z_n) p(x_{n+1}, \ldots, x_N \given z_n) \over p(X)}$$

Definitions

  • \\( \alpha ( z_n ) \equiv p(x_1, \ldots, x_n, z_n) \\)
  • \\( \beta (z_n) \equiv p(x_{n+1}, \ldots, x_N \given z_n)\\)

Example \\( \alpha \\) and \\( \beta \\)

  • \\( \alpha \\) is the probability of observing a given sequence of running behavior up to time \\( n \\) and a specific lazy/sporty state at time \\( n \\).
  • \\( \beta \\) is the probability of observing a future sequence of running behavior after time \\( n \\) given a specific lazy/sporty state at time \\( n \\).

Forward recursion

$$\begin{align}\alpha(z_n) = &p(x_n|z_n) \cdot \\ & \sum_{z_{n-1}} p(x_1, \ldots, x_{n-1}, z_{n-1}) p(z_n|z_{n-1})\end{align}$$

$$\alpha(z_n) = p(x_n \given z_n) \sum_{z_{n-1}} \alpha(z_{n-1}) p(z_n \given z_{n-1})$$

Initial condition: $$\alpha(z_1) = p(x_1, z_1) = \prod_{k=1}^K \left\{ \pi_k p(x_1 \given \phi_k)\right\}^{z_{1k}}$$

Backward recursion

$$\beta(z_n) = \sum_{z_{n+1}} \beta(z_{n+1}) p(x_{n+1} \given z_{n+1}) p(z_{n+1} \given z_n)$$

Take initial condition from

$$p(z_N \given X) = { p(X, z_N) \beta(z_N) \over p(X)}$$

So, set \\( \beta(z_N) = 1 \\) for all settings of \\( z_N \\).

R-friendly calculations

  • Remember, \\( \alpha \\) and \\( \beta \\) ech have \\( K \\) components.
  • How can we perform a step of say \\( \beta \\) recursion in an R-friendly way?

Example recursions

The algorithm is generic, only the emissions change from one implementation to the next.

  • \\( \beta(z_3) \\) initial condition is straightforward: a vector of 2 1's.
  • \\( \alpha(z_{11}) = 1/2\cdot p_s^{x_1} (1 - p_s)^{1-x_1}\\), \\( \alpha(z_{12}) = 1/2\cdot p_l^{x_1} (1 - p_l)^{1-x_1}\\)

Calculating \\( \xi \\)'s

$$\begin{align}\xi(z_{n-1}, & z_n) = \\ & \frac{\alpha(z_{n-1})p(x_n \given z_n) p(z_n \given z_{n-1}) \beta(z_n)}{p(X)}\end{align}$$

Likelihood calculation

$$p(X) = \sum_{z_n} \alpha(z_n) \beta(z_n) = \sum_{z_N} \alpha(z_N)$$

Example \\( \xi \\) and likelihood

  • Generic formulas.
  • Can obtain directly from the transition parameters at each iteration and the values for \\( \alpha \\) and \\( \beta \\).

Computing complexity

  • \\( K \\) terms in summation, \\( K \\) evaluations for each value of \\(z_n\\).
  • So, each step of \\( \alpha \\) recursion scales like \\( O (K^2) \\).
  • Since we need to do this for each observation in the chain, the overall complexity is \\( O(K^2N) \\).
  • Compare to \\( K^N \\) brute force summing out.
  • Add to this the backwards recursion, likelihood calculation, bookkeeping etc.

Example computing complexity

  • \\( K = 2 \\), \\( N = 3 \\), so \\( O(12) \\) - longer than brute-forcing it!
  • But the power quickly dominates when there are more observations.

So, does EM for HMMs converge?

  • There is no free lunch. Easier computation with EM comes with a price.
  • Fraction of missing information.
  • Intuitively, this depends on the number of hidden states.

More on HMMs

  • Predictive distribution.
  • Other algorithms.
  • Extensions of HMMs.
  • Bishop pp. 625-635

Properties of EM

  • Local maximization.
  • Sensitivity to starting points.
  • Will explore this in Pset 3.

Announcements

  • How is pset 3 going?

Resources

Final slide

  • Next lecture: ECM, EM and Bayesian approaches, EM + MCMC