 # Hacker level: Data Augmentation

### Stat 221, Lecture 23

@snesterko

• Data Augmentation
• In point estimation
• In full MCMC
• Main points
• DA + reparametrization - same or different?
• Computing/implementation implications

## Where we are ## Data Augmentation (DA)

• A paradox: make the system more complex in order to make it simpler*. How could that be?

*Simpler for who?

## DA and EM

• Using EM when there is no missing data.
• Using PX-EM - expanding parameter space when there is missing data.

## Example: \$$t \$$ distribution

$$f(x_i \given \theta) \sim \frac{\Gamma(\nu / 2 + 1/2)}{\sqrt{\pi\nu \sigma^2} \Gamma(\nu / 2) \left[1 + \frac{(x_i - \mu)^2}{\nu \sigma^2}\right]^{(\nu + 1) / 2}}$$

• Not an exponential family.
• Parameters \$$\theta = (\mu, \sigma) \$$.
• Define augmented complete dataset \$$Y = ( Y_{\rr{obs}}, Y_{\rr{mis}} ) \$$ with \$$Y_{\rr{obs}} = X \$$ and \$$Y_{\rr{mis}} = W = (w_1, \ldots, w_n) \$$

## Augmented data

$$x_i \given \theta, w_i \mathop{\sim}^{\rr{ind}} N( \mu, \sigma^2/w_i), \rr{ } w_i \given \theta \sim \chi^2_\nu / \nu$$

• Complete data is an exponential family with sufficient statistics

$$s_0 = \sum_{i=1}^n w_i, \rr{ } s_1 = \sum_{i=1}^n w_i x_i, \rr{ } s_2 = \sum_{i=1}^n w_i x_i^2$$

• Analytic E and M steps.
• Little and Rubin pp. 175-176.

## E step

$$w_i^{(t)} = E(w_i \given x_i, \theta) = \frac{\nu + 1}{\nu + d_i^{(t)^2}}$$

• where \$$d_i^{(t)} = (x_i - \mu^{(t)}) / \sigma^{(t)} \$$.

## M step

\begin{align}\mu^{(t+1)} & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \sigma^{(t+1)^2} & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)})^2 \\ & = \frac{s_2^{(t)} - s_1^{(t)^2} / s_0^{(t)}}{n}\end{align}

## PX-EM

• Augment parameter space.
• Define new parameters \$$\phi = (\theta^*, \alpha) \$$
• Original model is obtained by setting \$$\alpha = \alpha_0 \$$.
• When \$$\alpha = \alpha_0 \$$, \$$\theta^* = \theta \$$.
• Dimension of \$$\theta^* \$$ is the same as the dimension of \$$\theta \$$.
• \$$\theta = R( \theta^*, \alpha) \$$ for known transformation \$$R \$$.
• Convergence speeds up.
• Little and Rubin pp. 184-186.

## Moreover

• There should be no information about \$$\alpha \$$ in the observed data \$$Y_\rr{obs} \$$: $$f_x (Y_\rr{obs} \given \theta^*, \alpha) = f_x (Y_\rr{obs} \given \theta^*, \alpha') \rr{ } \forall \alpha, \alpha'$$
• Parameters \$$\phi \$$ in the expanded model are identifiable from the complete data \$$Y = ( Y_\rr{obs}, Y_\rr{mis} ) \$$.

## PX-EM algorithm

• PX-E step. Compute \$$Q(\phi \given \phi^{(t)}) = \rr{E} (\rr{log} f_x (Y \given \phi) \given Y_\rr{obs}, \phi^{(t)} )\$$
• PX-M step. Find \$$\phi^{(t+1)} = \rr{argmax}_\phi Q(\phi \given \phi^{(t)}) \$$, and then set \$$\theta^{(t+1)} = R( \theta^{*^{(t+1)}}, \alpha) \$$.

## Example: univariate \$$t \$$ with known degrees of freedom

\begin{align} x_i & \given \theta^*, \alpha, w_i \mathop{\sim}^{\rr{ind}} N( \mu_*, \sigma^2_*/w_i), \\ w_i & \given \theta^*, \alpha \sim \alpha \chi^2_\nu / \nu \end{align}

• Marginal density of \$$X = Y_\rr{obs} \$$ does not change by changing \$$\alpha \$$
• \$$\alpha \$$ can be identified from the joint distribution of \$$(X, W ) \$$
• \$$(\theta^*, \alpha) \rightarrow \theta: \$$ \$$\mu = \mu_* \$$, \$$\sigma = \sigma_* / \sqrt{\alpha} \$$

## PX-E step

At iteration \$$(t + 1) \$$, compute $$w_i^{(t)} = E(w_i \given x_i, \phi^{(t)}) = \alpha^{(t)}\frac{\nu + 1}{\nu + d_i^{(t)^2}}$$ where $$d_i^{(t)} = \sqrt{\alpha^{(t)}}(x_i - \mu^{(t)}_*) / \sigma^{(t)}_* = (x_i - \mu^{(t)}) / \sigma^{(t)}$$

## PX-M step

\begin{align} \mu^{(t+1)}_* & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \sigma^{(t+1)^2}_* & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)}_*)^2 \\ & = \frac{s_2^{(t)} - s_1^{(t)^2} / s_0^{(t)}}{n}\end{align}

## Ipdate the "extended" parameter \$$\alpha \$$

$$\alpha^{(t+1)} = {1 \over n} \sum_{i=1}^n w_i^{(t+1)}$$

## In the original parameter space

\begin{align}\mu^{(t+1)} & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \sigma^{(t+1)^2} & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)}_*)^2 / \sum_{i=1}^n w_i^{(t+1)} \end{align}

## What are the differences?

What are the differences compared to the non-parameter expanded EM from the previous example?

• Using the sum of weights in the denominator instead of sample size
• This speeds up convergence
• What is the price for better convergence properties?

## DA and MCMC

• Hamiltonian Monte Carlo
• Simulated Tempering
• Parallel Tempering
• Equi-Energy Sampler
• Open research
• Reparametrization approaches
• Perfect Sampling

## HMC structure

• We want to simulate \$$\theta \equiv q \$$ from \$$p(\theta \given Y) = \pi(\theta) L(\theta \given Y)\$$.
• Augment \$$\theta \$$ space - each element of \$$q \$$ gets a corresponding "brother" \$$p \$$.
• Potential energy function (linked to the posterior) $$U(q) = -\rr{log} \left( p (q \given Y)\right)$$
• Kinetic energy function (negative log-density of momentum variables) \$$K(p) \$$.

## Main intuition

In the Hamiltonian system, the sum of kinetic enery \$$P \$$ and potential energy \$$U \$$ is constant.

### HMC: how it works ## HMC algorithm

Two steps:

1. Draw new momentum variables \$$p^* \$$ from the distribution (usually Gauggian) defined by the kinetic energy function.
2. Propose a new state \$$q^* \$$ for position variables \$$q \$$ by simulating Hamiltonian dynamics given the momentum \$$p \$$ and the current value of \$$q\$$. Accept with probability

$$\rr{min}\left[ 1, \rr{exp}\left( -U(q^*) + U(q) - K(p^*) + K(p) \right)\right]$$

## Simulated Tempering

• Data augmentation! Ideas borrowed from physics.
• Every density function can be rewritten in this way: $$f(x) = e^{-(-\rr{log}f(x))}$$
• Define energy \$$E(x) = -\rr{log}f(x)\$$
• Aim to simulate from $$f(x, T) = e^{- {1 \over T} E(x)}$$
• \$$T \in \{ T_1, T_2, \ldots, T_m\}\$$ - pre-defined set of temperature levels (presumably including temperature 1).

## Properties

• Sample from the joint temperature-parameter space.
• Only use the draws from the parameter space with temperature 1.

## Parallel Tempering

• Aims to use the ideas of simulated tempering.
• Amenable to (designed for) parallelization.
• Idea: run chains at different temperatures in parallel, and "swap" parameter values between them once in a while.
• Goal: improve mixing at temperature 1.

## Definitions

Define composite system $$\Pi (X) = \prod_{i=1}^N {e^{- \beta_i E(x_i)} \over Z(T_i)}$$

with \$$\beta_i = {1 \over T_i }\$$ and \$$Z(T_i) = \int e^{- \beta_i E(x_i)} dx_i\$$.

• \$$\Pi \$$ reduces to \$$f \$$ if \$$T_i = 1 \$$.
• At each iteration \$$t \$$, construct a Markov Chain to sample from \$$E(\cdot) \$$ at \$$T_i\$$ for each \$$i \$$.

## The algorithm

initialize N temperatures x_1, ..., x_N
initialize N replicas x_1, ..., x_N
repeat {
sample a new point x_i for each replica i
with probability theta {
sample an integer j from U[1, N-1]
attempt a swap between replicas j and j + 1
}
}
• Swapping between replicas is using the augmented space to sample from original space better.
• What do we mean by "better"?

### Parallel Tempering ## Equi-Energy Sampler

• Augment the space with a rich family of parameters - almost copies of the original.
• The almost-copies have their own densities that are derived from the original distribution.

### Main idea of EE jumps Also move in constant energy space.

## EE sampler: definitions

• The goal is to sample from \$$\pi(x) \$$.
• \$$\pi(x) \$$ is presumably multimodal.
• Let \$$h(x) \$$ be the associated energy function: $$\pi(x) \propto \rr{exp}(-h(x))$$
• Sequence of energy levels $$H_0 < H_1 < H_2 < \ldots < H_K < H_{K+1} = \infty$$
• \$$H_0 \leq \rr{inf}_x h(x) \$$
• Associated sequence of temperatures $$1 = T_0 < T_1 < \ldots < T_K$$

## More definitions

• There are \$$K+1 \$$ distributions \$$\pi_i \$$, \$$0 \leq i \leq K \$$.
• Corresponding energy functions are $$h_i(x) = { 1 \over T } \left( h(x) \vee H_i\right)$$
• That is, $$\pi_i(x) \propto \rr{exp} (-h_i(x))$$
• This way \$$\pi_0 \$$ is the needed distribution \$$\pi \$$.
• High \$$i \$$ mean energy truncation and high temperatures, so it's easier to move between local modes.

## DA + perfect sampling

• The idea is to augment continuous parameter models with suitable discrete parameters and work to achieve perfect sampling on the discrete parameters
• Then perfect sampling is also achieved on the continuous parameters in the setting of a Gibbs sampler

## Reparametrization strategies

• Turns out, reparametrizing the model can help achieve stationarity faster.

## Announcements

• Last T-shirt comp is ongoing!
• Final projects assistance: contact assigned TFs

## Final slide

• Next lecture: Visualization - summing it up