Hacker level: Data Augmentation
Stat 221, Lecture 23
@snesterko
Roadmap
- Data Augmentation
- In point estimation
- In full MCMC
- Recent advances
- Main points
- DA + reparametrization - same or different?
- Computing/implementation implications
Where we are
Data Augmentation (DA)
- A paradox: make the system more complex in order to make
it simpler*. How could that be?
*Simpler for who?
DA and EM
- Using EM when there is no missing data.
- Using PX-EM - expanding parameter space when there is missing
data.
Example: \\( t \\) distribution
$$f(x_i \given \theta) \sim \frac{\Gamma(\nu / 2 + 1/2)}{\sqrt{\pi\nu \sigma^2} \Gamma(\nu / 2) \left[1 + \frac{(x_i - \mu)^2}{\nu \sigma^2}\right]^{(\nu + 1) / 2}}$$
- Not an exponential family.
- Parameters \\( \theta = (\mu, \sigma) \\).
- Define augmented complete dataset \\( Y = ( Y_{\rr{obs}}, Y_{\rr{mis}} ) \\) with \\( Y_{\rr{obs}} = X \\) and \\( Y_{\rr{mis}} = W = (w_1, \ldots, w_n) \\)
Augmented data
$$x_i \given \theta, w_i \mathop{\sim}^{\rr{ind}} N( \mu, \sigma^2/w_i), \rr{ } w_i \given \theta \sim \chi^2_\nu / \nu$$
- Complete data is an exponential family with sufficient statistics
$$s_0 = \sum_{i=1}^n w_i, \rr{ } s_1 = \sum_{i=1}^n w_i x_i, \rr{ } s_2 = \sum_{i=1}^n w_i x_i^2$$
- Analytic E and M steps.
- Little and Rubin pp. 175-176.
E step
$$w_i^{(t)} = E(w_i \given x_i, \theta) = \frac{\nu + 1}{\nu + d_i^{(t)^2}}$$
- where \\( d_i^{(t)} = (x_i - \mu^{(t)}) / \sigma^{(t)} \\).
M step
$$\begin{align}\mu^{(t+1)} & = \frac{s_1^{(t)}}{s_0^{(t)}} \\
\sigma^{(t+1)^2} & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)})^2 \\
& = \frac{s_2^{(t)} - s_1^{(t)^2} / s_0^{(t)}}{n}\end{align}$$
PX-EM
- Augment parameter space.
- Define new parameters \\( \phi = (\theta^*, \alpha) \\)
- Original model is obtained by setting \\( \alpha = \alpha_0 \\).
- When \\( \alpha = \alpha_0 \\), \\( \theta^* = \theta \\).
- Dimension of \\( \theta^* \\) is the same as the dimension of \\( \theta \\).
- \\( \theta = R( \theta^*, \alpha) \\) for known transformation \\( R \\).
- Convergence speeds up.
- Little and Rubin pp. 184-186.
Moreover
- There should be no information about \\( \alpha \\) in the observed
data \\( Y_\rr{obs} \\):
$$f_x (Y_\rr{obs} \given \theta^*, \alpha) = f_x (Y_\rr{obs} \given \theta^*, \alpha')
\rr{ } \forall \alpha, \alpha'$$
- Parameters \\( \phi \\) in the expanded model are identifiable from the complete
data \\( Y = ( Y_\rr{obs}, Y_\rr{mis} ) \\).
PX-EM algorithm
- PX-E step. Compute \\( Q(\phi \given \phi^{(t)}) = \rr{E} (\rr{log} f_x (Y \given \phi) \given Y_\rr{obs}, \phi^{(t)} )\\)
- PX-M step. Find \\( \phi^{(t+1)} = \rr{argmax}_\phi Q(\phi \given \phi^{(t)}) \\), and
then set \\( \theta^{(t+1)} = R( \theta^{*^{(t+1)}}, \alpha) \\).
Example: univariate \\( t \\) with known degrees of freedom
$$\begin{align}
x_i & \given \theta^*, \alpha, w_i \mathop{\sim}^{\rr{ind}} N( \mu_*, \sigma^2_*/w_i), \\
w_i & \given \theta^*, \alpha \sim \alpha \chi^2_\nu / \nu \end{align}$$
- Marginal density of \\( X = Y_\rr{obs} \\) does not change by changing \\( \alpha \\)
- \\( \alpha \\) can be identified from the joint distribution of \\( (X, W ) \\)
- \\( (\theta^*, \alpha) \rightarrow \theta: \\) \\( \mu = \mu_* \\), \\( \sigma = \sigma_* / \sqrt{\alpha} \\)
PX-E step
At iteration \\( (t + 1) \\), compute
$$w_i^{(t)} = E(w_i \given x_i, \phi^{(t)}) = \alpha^{(t)}\frac{\nu + 1}{\nu + d_i^{(t)^2}}$$
where
$$ d_i^{(t)} = \sqrt{\alpha^{(t)}}(x_i - \mu^{(t)}_*) / \sigma^{(t)}_* = (x_i - \mu^{(t)}) / \sigma^{(t)}$$
PX-M step
$$\begin{align}
\mu^{(t+1)}_* & = \frac{s_1^{(t)}}{s_0^{(t)}} \\
\sigma^{(t+1)^2}_* & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)}_*)^2 \\
& = \frac{s_2^{(t)} - s_1^{(t)^2} / s_0^{(t)}}{n}\end{align}$$
Ipdate the "extended" parameter \\( \alpha \\)
$$\alpha^{(t+1)} = {1 \over n} \sum_{i=1}^n w_i^{(t+1)}$$
In the original parameter space
$$\begin{align}\mu^{(t+1)} & = \frac{s_1^{(t)}}{s_0^{(t)}} \\
\sigma^{(t+1)^2} & = \frac{1}{n} \sum_{i=1}^n w_i^{(t)} (x_i - \mu^{(t+1)}_*)^2 / \sum_{i=1}^n w_i^{(t+1)} \end{align}$$
What are the differences?
What are the differences compared to the non-parameter expanded EM
from the previous example?
- Using the sum of weights in the denominator instead of sample size
- This speeds up convergence
- What is the price for better convergence properties?
DA and MCMC
- Hamiltonian Monte Carlo
- Simulated Tempering
- Parallel Tempering
- Equi-Energy Sampler
- Open research
- Reparametrization approaches
- Perfect Sampling
HMC structure
- We want to simulate \\( \theta \equiv q \\) from \\( p(\theta \given Y) = \pi(\theta) L(\theta \given Y)\\).
- Augment \\( \theta \\) space - each element of \\( q \\) gets a corresponding "brother" \\( p \\).
- Potential energy function (linked to the posterior) $$U(q) = -\rr{log} \left( p (q \given Y)\right)$$
- Kinetic energy function (negative log-density of momentum variables) \\( K(p) \\).
Main intuition
In the Hamiltonian system, the sum of kinetic enery \\( P \\) and
potential energy \\( U \\) is constant.
HMC algorithm
Two steps:
- Draw new momentum variables \\( p^* \\) from the distribution
(usually Gauggian) defined by the kinetic energy function.
- Propose a new state \\( q^* \\) for position variables \\( q \\) by
simulating Hamiltonian dynamics given the momentum \\( p \\) and
the current value of \\(q\\). Accept with probability
$$\rr{min}\left[ 1, \rr{exp}\left( -U(q^*) + U(q) - K(p^*) + K(p) \right)\right]$$
Simulated Tempering
- Data augmentation! Ideas borrowed from physics.
- Every density function can be rewritten in this way:
$$f(x) = e^{-(-\rr{log}f(x))}$$
- Define energy \\( E(x) = -\rr{log}f(x)\\)
- Aim to simulate from
$$f(x, T) = e^{- {1 \over T} E(x)}$$
- \\( T \in \{ T_1, T_2, \ldots, T_m\}\\) - pre-defined set of
temperature levels (presumably including temperature 1).
Properties
- Sample from the joint temperature-parameter space.
- Only use the draws from the parameter space with temperature 1.
Parallel Tempering
- Aims to use the ideas of simulated tempering.
- Amenable to (designed for) parallelization.
- Idea: run chains at different temperatures in parallel,
and "swap" parameter values between them once in a while.
- Goal: improve mixing at temperature 1.
Definitions
Define composite system
$$\Pi (X) = \prod_{i=1}^N {e^{- \beta_i E(x_i)} \over Z(T_i)}$$
with \\( \beta_i = {1 \over T_i }\\) and \\( Z(T_i) = \int e^{- \beta_i E(x_i)} dx_i\\).
- \\( \Pi \\) reduces to \\( f \\) if \\( T_i = 1 \\).
- At each iteration \\( t \\), construct a Markov Chain to
sample from \\( E(\cdot) \\) at \\(T_i\\) for each \\(i \\).
The algorithm
initialize N temperatures x_1, ..., x_N
initialize N replicas x_1, ..., x_N
repeat {
sample a new point x_i for each replica i
with probability theta {
sample an integer j from U[1, N-1]
attempt a swap between replicas j and j + 1
}
}
- Swapping between replicas is using the augmented space to
sample from original space better.
- What do we mean by "better"?
Equi-Energy Sampler
- Augment the space with a rich family of
parameters - almost copies of the original.
- The almost-copies have their own densities that are
derived from the original distribution.
Main idea of EE jumps
Also move in constant energy space.
EE sampler: definitions
- The goal is to sample from \\( \pi(x) \\).
- \\( \pi(x) \\) is presumably multimodal.
- Let \\( h(x) \\) be the associated energy function:
$$\pi(x) \propto \rr{exp}(-h(x))$$
- Sequence of energy levels
$$H_0 < H_1 < H_2 < \ldots < H_K < H_{K+1} = \infty$$
- \\( H_0 \leq \rr{inf}_x h(x) \\)
- Associated sequence of temperatures
$$1 = T_0 < T_1 < \ldots < T_K$$
More definitions
- There are \\( K+1 \\) distributions \\( \pi_i \\), \\( 0 \leq i \leq K \\).
- Corresponding energy functions are
$$h_i(x) = { 1 \over T } \left( h(x) \vee H_i\right)$$
- That is,
$$\pi_i(x) \propto \rr{exp} (-h_i(x))$$
- This way \\( \pi_0 \\) is the needed distribution \\( \pi \\).
- High \\( i \\) mean energy truncation and high temperatures,
so it's easier to move between local modes.
How perfect sampling works
DA + perfect sampling
- The idea is to augment continuous parameter models
with suitable discrete parameters and work to achieve
perfect sampling on the discrete parameters
- Then perfect sampling is also achieved on the continuous
parameters in the setting of a Gibbs sampler
Reparametrization strategies
- Turns out, reparametrizing the model
can help achieve stationarity faster.
Announcements
- Last T-shirt comp is ongoing!
- Final projects assistance: contact assigned TFs
Final slide
- Next lecture: Visualization - summing it up