Hacker level: Data Augmentation

Stat 221, Lecture 23

@snesterko

Roadmap

Data Augmentation
- In point estimation
- In full MCMC
- Recent advances
Main points
- DA + reparametrization - same or different?
- Computing/implementation implications

Where we are

Data Augmentation (DA)

A paradox: make the system more complex in order to make it simpler*. How could that be?

*Simpler for who?

DA and EM

Using EM when there is no missing data.
Using PX-EM - expanding parameter space when there is missing data.

Example: \$ t \$ distribution

$$f(x_i \given \theta) \sim \frac{\Gamma(\nu / 2 + 1/2)}{\sqrt{\pi\nu \sigma^2} \Gamma(\nu / 2) \left[1 + \frac{(x_i - \mu)^2}{\nu \sigma^2}\right]^{(\nu + 1) / 2}}$$

Not an exponential family.
Parameters \$ \theta = (\mu, \sigma) \$.
Define augmented complete dataset \$ Y = ( Y_{\rr{obs}}, Y_{\rr{mis}} ) \$ with \$ Y_{\rr{obs}} = X \$ and \$ Y_{\rr{mis}} = W = (w_1, \ldots, w_n) \$

Augmented data

$$x_i \given \theta, w_i \mathop{\sim}^{\rr{ind}} N( \mu, \sigma^2/w_i), \rr{ } w_i \given \theta \sim \chi^2_\nu / \nu$$

Complete data is an exponential family with sufficient statistics

$$s_0 = \sum_{i=1}^n w_i, \rr{ } s_1 = \sum_{i=1}^n w_i x_i, \rr{ } s_2 = \sum_{i=1}^n w_i x_i^2$$

Analytic E and M steps.
Little and Rubin pp. 175-176.

E step

$$w_i^{(t)} = E(w_i \given x_i, \theta) = \frac{\nu + 1}{\nu + d_i^{(t)^2}}$$

where \$ d_i^{(t)} = (x_i - \mu^{(t)}) / \sigma^{(t)} \$.

M step

$$\begin{align}\mu^{(t+1)} & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \sigma^{(t+1)^2} & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)})^2 \\ & = \frac{s_2^{(t)} - s_1^{(t)^2} / s_0^{(t)}}{n}\end{align}$$

PX-EM

Augment parameter space.
Define new parameters \$ \phi = (\theta^*, \alpha) \$
Original model is obtained by setting \$ \alpha = \alpha_0 \$.
When \$ \alpha = \alpha_0 \$, \$ \theta^* = \theta \$.
Dimension of \$ \theta^* \$ is the same as the dimension of \$ \theta \$.
\$ \theta = R( \theta^*, \alpha) \$ for known transformation \$ R \$.
Convergence speeds up.
Little and Rubin pp. 184-186.

Moreover

There should be no information about \$ \alpha \$ in the observed data \$ Y_\rr{obs} \$: $$f_x (Y_\rr{obs} \given \theta^*, \alpha) = f_x (Y_\rr{obs} \given \theta^*, \alpha') \rr{ } \forall \alpha, \alpha'$$
Parameters \$ \phi \$ in the expanded model are identifiable from the complete data \$ Y = ( Y_\rr{obs}, Y_\rr{mis} ) \$.

PX-EM algorithm

PX-E step. Compute \$ Q(\phi \given \phi^{(t)}) = \rr{E} (\rr{log} f_x (Y \given \phi) \given Y_\rr{obs}, \phi^{(t)} )\$
PX-M step. Find \$ \phi^{(t+1)} = \rr{argmax}_\phi Q(\phi \given \phi^{(t)}) \$, and then set \$ \theta^{(t+1)} = R( \theta^{*^{(t+1)}}, \alpha) \$.

Example: univariate \$ t \$ with known degrees of freedom

$$\begin{align} x_i & \given \theta^*, \alpha, w_i \mathop{\sim}^{\rr{ind}} N( \mu_*, \sigma^2_*/w_i), \\ w_i & \given \theta^*, \alpha \sim \alpha \chi^2_\nu / \nu \end{align}$$

Marginal density of \$ X = Y_\rr{obs} \$ does not change by changing \$ \alpha \$
\$ \alpha \$ can be identified from the joint distribution of \$ (X, W ) \$
\$ (\theta^*, \alpha) \rightarrow \theta: \$ \$ \mu = \mu_* \$, \$ \sigma = \sigma_* / \sqrt{\alpha} \$

PX-E step

At iteration \$ (t + 1) \$, compute $$w_i^{(t)} = E(w_i \given x_i, \phi^{(t)}) = \alpha^{(t)}\frac{\nu + 1}{\nu + d_i^{(t)^2}}$$ where $$ d_i^{(t)} = \sqrt{\alpha^{(t)}}(x_i - \mu^{(t)}_*) / \sigma^{(t)}_* = (x_i - \mu^{(t)}) / \sigma^{(t)}$$

PX-M step

$$\begin{align} \mu^{(t+1)}_* & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \sigma^{(t+1)^2}_* & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)}_*)^2 \\ & = \frac{s_2^{(t)} - s_1^{(t)^2} / s_0^{(t)}}{n}\end{align}$$

Ipdate the "extended" parameter \$ \alpha \$

$$\alpha^{(t+1)} = {1 \over n} \sum_{i=1}^n w_i^{(t+1)}$$

In the original parameter space

$$\begin{align}\mu^{(t+1)} & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \sigma^{(t+1)^2} & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)}_*)^2 / \sum_{i=1}^n w_i^{(t+1)} \end{align}$$

What are the differences?

What are the differences compared to the non-parameter expanded EM from the previous example?

Using the sum of weights in the denominator instead of sample size
This speeds up convergence
What is the price for better convergence properties?

DA and MCMC

Hamiltonian Monte Carlo
Simulated Tempering
Parallel Tempering
Equi-Energy Sampler
Open research
- Reparametrization approaches
- Perfect Sampling

HMC structure

We want to simulate \$ \theta \equiv q \$ from \$ p(\theta \given Y) = \pi(\theta) L(\theta \given Y)\$.
Augment \$ \theta \$ space - each element of \$ q \$ gets a corresponding "brother" \$ p \$.
Potential energy function (linked to the posterior) $$U(q) = -\rr{log} \left( p (q \given Y)\right)$$
Kinetic energy function (negative log-density of momentum variables) \$ K(p) \$.

Main intuition

In the Hamiltonian system, the sum of kinetic enery \$ P \$ and potential energy \$ U \$ is constant.

HMC: how it works

HMC algorithm

Two steps:

Draw new momentum variables \$ p^* \$ from the distribution (usually Gauggian) defined by the kinetic energy function.
Propose a new state \$ q^* \$ for position variables \$ q \$ by simulating Hamiltonian dynamics given the momentum \$ p \$ and the current value of \$q\$. Accept with probability

$$\rr{min}\left[ 1, \rr{exp}\left( -U(q^*) + U(q) - K(p^*) + K(p) \right)\right]$$

Simulated Tempering

Data augmentation! Ideas borrowed from physics.
Every density function can be rewritten in this way: $$f(x) = e^{-(-\rr{log}f(x))}$$
Define energy \$ E(x) = -\rr{log}f(x)\$
Aim to simulate from $$f(x, T) = e^{- {1 \over T} E(x)}$$
\$ T \in \{ T_1, T_2, \ldots, T_m\}\$ - pre-defined set of temperature levels (presumably including temperature 1).

Properties

Sample from the joint temperature-parameter space.
Only use the draws from the parameter space with temperature 1.

Parallel Tempering

Aims to use the ideas of simulated tempering.
Amenable to (designed for) parallelization.
Idea: run chains at different temperatures in parallel, and "swap" parameter values between them once in a while.
Goal: improve mixing at temperature 1.

Definitions

Define composite system $$\Pi (X) = \prod_{i=1}^N {e^{- \beta_i E(x_i)} \over Z(T_i)}$$

with \$ \beta_i = {1 \over T_i }\$ and \$ Z(T_i) = \int e^{- \beta_i E(x_i)} dx_i\$.

\$ \Pi \$ reduces to \$ f \$ if \$ T_i = 1 \$.
At each iteration \$ t \$, construct a Markov Chain to sample from \$ E(\cdot) \$ at \$T_i\$ for each \$i \$.

The algorithm

initialize N temperatures x_1, ..., x_N
initialize N replicas x_1, ..., x_N
repeat {
  sample a new point x_i for each replica i
  with probability theta {
    sample an integer j from U[1, N-1]
    attempt a swap between replicas j and j + 1
  }
}

Swapping between replicas is using the augmented space to sample from original space better.
What do we mean by "better"?

Parallel Tempering

Equi-Energy Sampler

Augment the space with a rich family of parameters - almost copies of the original.
The almost-copies have their own densities that are derived from the original distribution.

Main idea of EE jumps

Also move in constant energy space.

EE sampler: definitions

The goal is to sample from \$ \pi(x) \$.
\$ \pi(x) \$ is presumably multimodal.
Let \$ h(x) \$ be the associated energy function: $$\pi(x) \propto \rr{exp}(-h(x))$$
Sequence of energy levels $$H_0 < H_1 < H_2 < \ldots < H_K < H_{K+1} = \infty$$
\$ H_0 \leq \rr{inf}_x h(x) \$
Associated sequence of temperatures $$1 = T_0 < T_1 < \ldots < T_K$$

More definitions

There are \$ K+1 \$ distributions \$ \pi_i \$, \$ 0 \leq i \leq K \$.
Corresponding energy functions are $$h_i(x) = { 1 \over T } \left( h(x) \vee H_i\right)$$
That is, $$\pi_i(x) \propto \rr{exp} (-h_i(x))$$
This way \$ \pi_0 \$ is the needed distribution \$ \pi \$.
High \$ i \$ mean energy truncation and high temperatures, so it's easier to move between local modes.

Current research

Data Augmentation + perfect sampling
Reparametrization methods

How perfect sampling works

Can only work on discrete space
"Coupling from the past"

DA + perfect sampling

The idea is to augment continuous parameter models with suitable discrete parameters and work to achieve perfect sampling on the discrete parameters
Then perfect sampling is also achieved on the continuous parameters in the setting of a Gibbs sampler

Reparametrization strategies

Turns out, reparametrizing the model can help achieve stationarity faster.

Why it works: idea

Announcements

Last T-shirt comp is ongoing!
Final projects assistance: contact assigned TFs

Resources

Slides nesterko.com/lectures/stat221-2012/lecture23
Class website theory.info/harvardstat221
Class Piazza piazza.com/class#spring2013/stat221
Class Twitter twitter.com/harvardstat221

Final slide

Next lecture: Visualization - summing it up

Hacker level: Data Augmentation

Stat 221, Lecture 23

Roadmap

Where we are

Data Augmentation (DA)

DA and EM

Example: \\( t \\) distribution

Augmented data

E step

M step

PX-EM

Moreover

PX-EM algorithm

Example: univariate \\( t \\) with known degrees of freedom

PX-E step

PX-M step

Ipdate the "extended" parameter \\( \alpha \\)

In the original parameter space

What are the differences?

DA and MCMC

HMC structure

Main intuition

HMC: how it works

HMC algorithm

Simulated Tempering

Properties

Parallel Tempering

Definitions

The algorithm

Parallel Tempering

Equi-Energy Sampler

Main idea of EE jumps

EE sampler: definitions

More definitions

Current research

How perfect sampling works

DA + perfect sampling

Reparametrization strategies

Why it works: idea

Announcements

Resources

Final slide