Hacker level: Data Augmentation

Stat 221, Lecture 23

@snesterko

Roadmap

  • Data Augmentation
    • In point estimation
    • In full MCMC
    • Recent advances
  • Main points
    • DA + reparametrization - same or different?
    • Computing/implementation implications

Where we are

Data Augmentation (DA)

  • A paradox: make the system more complex in order to make it simpler*. How could that be?

*Simpler for who?

DA and EM

  • Using EM when there is no missing data.
  • Using PX-EM - expanding parameter space when there is missing data.

Example: \\( t \\) distribution

$$f(x_i \given \theta) \sim \frac{\Gamma(\nu / 2 + 1/2)}{\sqrt{\pi\nu \sigma^2} \Gamma(\nu / 2) \left[1 + \frac{(x_i - \mu)^2}{\nu \sigma^2}\right]^{(\nu + 1) / 2}}$$

  • Not an exponential family.
  • Parameters \\( \theta = (\mu, \sigma) \\).
  • Define augmented complete dataset \\( Y = ( Y_{\rr{obs}}, Y_{\rr{mis}} ) \\) with \\( Y_{\rr{obs}} = X \\) and \\( Y_{\rr{mis}} = W = (w_1, \ldots, w_n) \\)

Augmented data

$$x_i \given \theta, w_i \mathop{\sim}^{\rr{ind}} N( \mu, \sigma^2/w_i), \rr{ } w_i \given \theta \sim \chi^2_\nu / \nu$$

  • Complete data is an exponential family with sufficient statistics

$$s_0 = \sum_{i=1}^n w_i, \rr{ } s_1 = \sum_{i=1}^n w_i x_i, \rr{ } s_2 = \sum_{i=1}^n w_i x_i^2$$

  • Analytic E and M steps.
  • Little and Rubin pp. 175-176.

E step

$$w_i^{(t)} = E(w_i \given x_i, \theta) = \frac{\nu + 1}{\nu + d_i^{(t)^2}}$$

  • where \\( d_i^{(t)} = (x_i - \mu^{(t)}) / \sigma^{(t)} \\).

M step

$$\begin{align}\mu^{(t+1)} & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \sigma^{(t+1)^2} & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)})^2 \\ & = \frac{s_2^{(t)} - s_1^{(t)^2} / s_0^{(t)}}{n}\end{align}$$

PX-EM

  • Augment parameter space.
  • Define new parameters \\( \phi = (\theta^*, \alpha) \\)
  • Original model is obtained by setting \\( \alpha = \alpha_0 \\).
  • When \\( \alpha = \alpha_0 \\), \\( \theta^* = \theta \\).
  • Dimension of \\( \theta^* \\) is the same as the dimension of \\( \theta \\).
  • \\( \theta = R( \theta^*, \alpha) \\) for known transformation \\( R \\).
  • Convergence speeds up.
  • Little and Rubin pp. 184-186.

Moreover

  • There should be no information about \\( \alpha \\) in the observed data \\( Y_\rr{obs} \\): $$f_x (Y_\rr{obs} \given \theta^*, \alpha) = f_x (Y_\rr{obs} \given \theta^*, \alpha') \rr{ } \forall \alpha, \alpha'$$
  • Parameters \\( \phi \\) in the expanded model are identifiable from the complete data \\( Y = ( Y_\rr{obs}, Y_\rr{mis} ) \\).

PX-EM algorithm

  • PX-E step. Compute \\( Q(\phi \given \phi^{(t)}) = \rr{E} (\rr{log} f_x (Y \given \phi) \given Y_\rr{obs}, \phi^{(t)} )\\)
  • PX-M step. Find \\( \phi^{(t+1)} = \rr{argmax}_\phi Q(\phi \given \phi^{(t)}) \\), and then set \\( \theta^{(t+1)} = R( \theta^{*^{(t+1)}}, \alpha) \\).

Example: univariate \\( t \\) with known degrees of freedom

$$\begin{align} x_i & \given \theta^*, \alpha, w_i \mathop{\sim}^{\rr{ind}} N( \mu_*, \sigma^2_*/w_i), \\ w_i & \given \theta^*, \alpha \sim \alpha \chi^2_\nu / \nu \end{align}$$

  • Marginal density of \\( X = Y_\rr{obs} \\) does not change by changing \\( \alpha \\)
  • \\( \alpha \\) can be identified from the joint distribution of \\( (X, W ) \\)
  • \\( (\theta^*, \alpha) \rightarrow \theta: \\) \\( \mu = \mu_* \\), \\( \sigma = \sigma_* / \sqrt{\alpha} \\)

PX-E step

At iteration \\( (t + 1) \\), compute $$w_i^{(t)} = E(w_i \given x_i, \phi^{(t)}) = \alpha^{(t)}\frac{\nu + 1}{\nu + d_i^{(t)^2}}$$ where $$ d_i^{(t)} = \sqrt{\alpha^{(t)}}(x_i - \mu^{(t)}_*) / \sigma^{(t)}_* = (x_i - \mu^{(t)}) / \sigma^{(t)}$$

PX-M step

$$\begin{align} \mu^{(t+1)}_* & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \sigma^{(t+1)^2}_* & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)}_*)^2 \\ & = \frac{s_2^{(t)} - s_1^{(t)^2} / s_0^{(t)}}{n}\end{align}$$

Ipdate the "extended" parameter \\( \alpha \\)

$$\alpha^{(t+1)} = {1 \over n} \sum_{i=1}^n w_i^{(t+1)}$$

In the original parameter space

$$\begin{align}\mu^{(t+1)} & = \frac{s_1^{(t)}}{s_0^{(t)}} \\ \sigma^{(t+1)^2} & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)}_*)^2 / \sum_{i=1}^n w_i^{(t+1)} \end{align}$$

What are the differences?

What are the differences compared to the non-parameter expanded EM from the previous example?

  • Using the sum of weights in the denominator instead of sample size
  • This speeds up convergence
  • What is the price for better convergence properties?

DA and MCMC

  • Hamiltonian Monte Carlo
  • Simulated Tempering
  • Parallel Tempering
  • Equi-Energy Sampler
  • Open research
    • Reparametrization approaches
    • Perfect Sampling

HMC structure

  • We want to simulate \\( \theta \equiv q \\) from \\( p(\theta \given Y) = \pi(\theta) L(\theta \given Y)\\).
  • Augment \\( \theta \\) space - each element of \\( q \\) gets a corresponding "brother" \\( p \\).
  • Potential energy function (linked to the posterior) $$U(q) = -\rr{log} \left( p (q \given Y)\right)$$
  • Kinetic energy function (negative log-density of momentum variables) \\( K(p) \\).

Main intuition

In the Hamiltonian system, the sum of kinetic enery \\( P \\) and potential energy \\( U \\) is constant.

HMC: how it works

HMC algorithm

Two steps:

  1. Draw new momentum variables \\( p^* \\) from the distribution (usually Gauggian) defined by the kinetic energy function.
  2. Propose a new state \\( q^* \\) for position variables \\( q \\) by simulating Hamiltonian dynamics given the momentum \\( p \\) and the current value of \\(q\\). Accept with probability

$$\rr{min}\left[ 1, \rr{exp}\left( -U(q^*) + U(q) - K(p^*) + K(p) \right)\right]$$

Simulated Tempering

  • Data augmentation! Ideas borrowed from physics.
  • Every density function can be rewritten in this way: $$f(x) = e^{-(-\rr{log}f(x))}$$
  • Define energy \\( E(x) = -\rr{log}f(x)\\)
  • Aim to simulate from $$f(x, T) = e^{- {1 \over T} E(x)}$$
  • \\( T \in \{ T_1, T_2, \ldots, T_m\}\\) - pre-defined set of temperature levels (presumably including temperature 1).

Properties

  • Sample from the joint temperature-parameter space.
  • Only use the draws from the parameter space with temperature 1.

Parallel Tempering

  • Aims to use the ideas of simulated tempering.
  • Amenable to (designed for) parallelization.
  • Idea: run chains at different temperatures in parallel, and "swap" parameter values between them once in a while.
  • Goal: improve mixing at temperature 1.

Definitions

Define composite system $$\Pi (X) = \prod_{i=1}^N {e^{- \beta_i E(x_i)} \over Z(T_i)}$$

with \\( \beta_i = {1 \over T_i }\\) and \\( Z(T_i) = \int e^{- \beta_i E(x_i)} dx_i\\).

  • \\( \Pi \\) reduces to \\( f \\) if \\( T_i = 1 \\).
  • At each iteration \\( t \\), construct a Markov Chain to sample from \\( E(\cdot) \\) at \\(T_i\\) for each \\(i \\).

The algorithm

initialize N temperatures x_1, ..., x_N
initialize N replicas x_1, ..., x_N
repeat {
  sample a new point x_i for each replica i
  with probability theta {
    sample an integer j from U[1, N-1]
    attempt a swap between replicas j and j + 1
  }
}
  • Swapping between replicas is using the augmented space to sample from original space better.
  • What do we mean by "better"?

Parallel Tempering

Equi-Energy Sampler

  • Augment the space with a rich family of parameters - almost copies of the original.
  • The almost-copies have their own densities that are derived from the original distribution.

Main idea of EE jumps

Also move in constant energy space.

EE sampler: definitions

  • The goal is to sample from \\( \pi(x) \\).
  • \\( \pi(x) \\) is presumably multimodal.
  • Let \\( h(x) \\) be the associated energy function: $$\pi(x) \propto \rr{exp}(-h(x))$$
  • Sequence of energy levels $$H_0 < H_1 < H_2 < \ldots < H_K < H_{K+1} = \infty$$
  • \\( H_0 \leq \rr{inf}_x h(x) \\)
  • Associated sequence of temperatures $$1 = T_0 < T_1 < \ldots < T_K$$

More definitions

  • There are \\( K+1 \\) distributions \\( \pi_i \\), \\( 0 \leq i \leq K \\).
  • Corresponding energy functions are $$h_i(x) = { 1 \over T } \left( h(x) \vee H_i\right)$$
  • That is, $$\pi_i(x) \propto \rr{exp} (-h_i(x))$$
  • This way \\( \pi_0 \\) is the needed distribution \\( \pi \\).
  • High \\( i \\) mean energy truncation and high temperatures, so it's easier to move between local modes.

Current research

How perfect sampling works

DA + perfect sampling

  • The idea is to augment continuous parameter models with suitable discrete parameters and work to achieve perfect sampling on the discrete parameters
  • Then perfect sampling is also achieved on the continuous parameters in the setting of a Gibbs sampler

Reparametrization strategies

  • Turns out, reparametrizing the model can help achieve stationarity faster.

Why it works: idea

Announcements

  • Last T-shirt comp is ongoing!
  • Final projects assistance: contact assigned TFs

Resources

Final slide

  • Next lecture: Visualization - summing it up