Likelihood + Prior = Posterior (Bayesian inference)

Stat 221, Lecture 13

@snesterko

Roadmap

  • Recap what we've learned about EM.
  • Introduce the Bayesian paradigm.
    • Strength of prior
  • The Metropolis Algorithm
  • Gibbs sampling
  • Peek at HMC

Recap EM

  • A half-Bayesian algorithm
  • MLE with missing data/latent variables
  • The E step
  • The M step
  • EM for HMMs
  • Tweaks of the EM algorithm
    • Data augmentation
    • ECM (cond. max. - like Gibbs)
    • PX-EM (speed up convergence)
    • mcEM (danger approx. method)

Summary: EM is cool!

  • A great approximate/quick method*

*with different caveats

The Bayesian paradigm

$$P(A \given B) = \frac{P(A, B)}{P(B)}$$

  • "Everything is a random variable."
  • There are quantities we condition on that are considered fixed, all others are random variables.

Classic coinflip example

$$\theta \sim \rr{Beta}(\alpha, \beta), \rr{ } Y_i \given \theta \sim \rr{Bernoulli}(\theta)$$

$$ \rr{versus } Y_i \sim \rr{Bernoulli}(\theta), \rr{ } \theta \rr{ unknown} $$

Prior, likelihood, posterior

  • Prior \\( \theta \sim \pi(\theta) \\).
  • Likelihood \\( y \given \theta \sim p(y \given \theta) \\).
  • Posterior \\( p( \theta \given y) \propto p (\theta, y) = \pi(\theta) p(y \given \theta) \\).

The coinflips

$$ \pi(\theta) = \frac{1}{\rr{B}(\alpha, \beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} $$

$$p(y_i \given \theta) = \left( \begin{array} Nn \\ y_i\end{array}\right) \theta^{y_i} (1 - \theta)^{1-y_i}$$

$$p(\theta \given y_i) = \rr{Beta}(\alpha + y_1, \beta + 1 - y_i)$$

  • Conjugate models.

Normal-normal model

$$\begin{align}X_i \given \mu & \mathop{\sim}^{\rr{iid}} \rr{Normal}\left( \mu, \sigma^2 \right), \rr{ } i = 1..n; \\ \mu & \sim \rr{Normal}\left( \mu_0, \sigma_0^2\right)\end{align}$$

  • Is this a conjugate model?

Posterior

$$\begin{align} p(\mu|X) & = \pi(\mu)\prod_{i=1}^n \tfrac{1}{\sqrt{2\pi\sigma^2}}\, e^{-(x_i-\mu)^2/(2\sigma^2)} \\ & = \pi(\mu)(2\pi\sigma^2)^{-n/2}\, e^{ -\sum_{i=1}^n(x_i-\mu)^2/(2\sigma^2)} \\ & = \pi(\mu) (2\pi\sigma^2)^{-n/2}\, e^{ -\sum_{i=1}^n{( (x_i-\bar{x}) - (\mu-\bar{x}) )^2 \over 2\sigma^2}} \\ & \propto e^{ {-1\over2\sigma_0^2} (\mu-\mu_0)^2} e^{ {-n\over2\sigma^2}(\mu-\bar{x})^2 } \end{align}$$

Bayesian point estimation

  • MLE changes meaning, it is now MAP.
  • MAP+curvature at the mode is still OK given the posterior looks Normal.

Normal looking posterior

Problem set example

  • Complete-data log-likelihood:

$$\begin{align} & l(B \given \{\vec{Z}_i, C_{ijk}\} ) = \rr{const} + \\ & \sum_{i=1}^I \left( \sum_{j \in J_i} \sum_{k \in K_{ij}} \left[ \vec{x}_{jk}' B - \rr{log} \sum_{l = 1}^{L_j} e^{ \vec{x}_{jl}' B} \right] \right) \cdot \vec{Z}_i \end{align}$$

Notation

Let

$$\rr{log}\vec{a}_i = \sum_{j \in J_i} \sum_{k \in K_{ij}} \left[ \vec{x}_{jk}' B - \rr{log} \sum_{l = 1}^{L_j} e^{ \vec{x}_{jl}' B} \right]$$

With a prior on \\( Z_i \\)

Let

$$Z_i \sim \rr{Multinom}_N \left(\vec{p}_i\right)$$

$$\begin{align} \rr{log}p(B, Z \given C_{ijk} ) = & \rr{const} + \\ & \sum_{i=1}^I \left( \rr{log} \vec{p}_i + \rr{log}\vec{a}_i\right) \cdot \vec{Z}_i \end{align}$$

Example posterior

$$\begin{align}\vec{Z}_i \given B, \rr{ data} \sim \rr{Multinom}_N \left( \vec{p}_i \vec{a}_i\right)\end{align}$$

*probabilities proportional to; element-wise vector product

  • Posterior of \\( B \\)?

Strength of the prior

Strength of the prior

  • Cool idea -- measure the strength of prior in terms of the number of data points.
  • Depends on the model.
  • Current work of Xiao-Li Meng (JSM 2012, session "Distributional Inference for Nonparametric Problems").
  • Condition on, or integrate over the observed data?

Strength of prior in example

  • Strength of prior in homework.
    • Let's assume we have a 1000 data points and all \\( \vec{a}_i \propto 1/N \cdot \vec{1} \\)
    • For simplicity, assume \\( N= 2\\).
    • But let's put $$\vec{p}_i = \left( \begin{array} 00.01 \\ 0.99\end{array}\right)$$
    • Posterior distribution of \\( Z_i \given B, \rr{ data} \\)?
  • Reason for having a diffuse prior: it is dogmatic.

Inference with MCMC

Metropolis algorithm

BDA pp. 289-292:

  1. Draw starting point \\( \theta^0 \\).
  2. Sample proposal \\( \theta^* \\) from jumping distribution \\( J_t (\theta^* \given \theta^{t-1}) \\). The jumping distribution must be symmetric, i.e. \\( J_t (\theta_a \given \theta_b) = J_t (\theta_b \given \theta_a) \\).
  3. Calculate the ratio of the densities \\( r = \frac{p(\theta^*\given y)}{p(\theta^{t-1} \given y)}\\).
  4. Set \\( \theta^t = \begin{cases} \theta^* & \rr{with probability min} (r, 1) \\ \theta^{t-1} & \rr{otherwise}\end{cases} \\)

Metropolis in 1D

Metropolis in 2D

Proof of Metropolis

Reference: BDA pp. 290-291

  1. The constructed chain is aperiodic, irreducible, and not transient: there exists a unique stationary distribution.
  2. The stationary distribution is the target distribution.

Detailed balance:

$$p(\theta_a) J(\theta_a \given \theta_b) = p(\theta_b) J (\theta_b \given \theta_a)$$

Example of Metropolis

A basic setup:

  • Target density \\( p(\theta \given y) = \rr{N}_2(\theta \given 0, I) \\)
  • Proposal density \\( \rr{N}(\theta^* \given \theta^{(t-1)}, 0.2^2I) \\)
  • Jumping rule is clearly symmetric.
  • Then at each step the acceptance probability $$\rr{min}\left( 1, {\rr{N}(\theta^* \given 0, I) \over \rr{N}(\theta^{(t-1)} \given 0, I)}\right)$$

Gibbs sampler

A simple yet powerful concept: divide and conquer.

  • BDA pp. 287-289.
  • "Each iteration of the Gibbs sampler cycles through the subvectors of \\( \theta \\), drawing each subset conditional on the calue of all the others."

Example: bivariate Normal

Posterior distribution: $$\left.\left( \begin{array} d\theta_1 \\ \theta_2\end{array}\right)\right| \rr{ }y \sim \rr{N} \left( \left( \begin{array} dy_1 \\ y_2\end{array}\right), \left( \begin{array} n1 & \rho \\ \rho & 1 \end{array} \right) \right)$$

Gibbs updates: $$\begin{align} \theta_1 \given \theta_2, y & \sim \rr{N} (y_1 + \rho (\theta_2 - y_2), 1 - \rho^2) \\ \theta_2 \given \theta_1, y & \sim \rr{N} (y_2 + \rho (\theta_1 - y_1), 1 - \rho^2)\end{align}$$

Proof of Gibbs 1

Define \\(x \sim_j y \\) if \\( \left.x_i = y_i\right.\\) for all \\(i \neq j\\) and let \\( \left.p_{xy}\right.\\) denote the probability of a jump from \\( x \in \Theta \\) to \\( y \in \Theta\\). Then, the transition probabilities are

$$p_{xy} = \begin{cases} \frac{1}{d}\frac{g(y)}{\sum_{z \in \Theta: z \sim_j x} g(z) } & x \sim_j y \\ 0 & \text{otherwise} \end{cases}$$

Reference: Wikipedia.

Proof of Gibbs 2

So $$\begin{align}g(x) p_{xy} & = \frac{1}{d}\frac{ g(x) g(y)}{\sum_{z \in \Theta: z \sim_j x} g(z) } \\ & = \frac{1}{d}\frac{ g(y) g(x)}{\sum_{z \in \Theta: z \sim_j y} g(z) } = g(y) p_{yx}\end{align}$$

since \\(x \sim_j y \\) is an equivalence relation.

So, the detailed balance is satisfied, Gibbs works.

Metropolis-within-Gibbs

  • Can replace the exact draw of a subvector of \\( \theta \\) with a Metropolis iteration (or a few Metropolis iterations).

Ex.: Normal, \\( \rho \\) unknown

$$\left.\left( \begin{array} d\theta_1 \\ \theta_2\end{array}\right)\right| \rr{ }y, \rho \sim \rr{N} \left( \left( \begin{array} dy_1 \\ y_2\end{array}\right), \left( \begin{array} n1 & \rho \\ \rho & 1 \end{array} \right) \right)$$

Gibbs updates: $$\begin{align} \theta_1 \given \theta_2, y & \sim \rr{N} (y_1 + \rho (\theta_2 - y_2), 1 - \rho^2) \\ \theta_2 \given \theta_1, y & \sim \rr{N} (y_2 + \rho (\theta_1 - y_1), 1 - \rho^2)\end{align}$$

$$\rho \given \theta_1, \theta_2, y \sim \rr{ draw via Metropolis}$$

Difficulty

  • The support of \\( \rho \\) is on \\( [0, 1] \\), hard to design a symmetric jumping distribution.
  • Answer: transform the space!

Metropolis on transformed space

  • Define a function \\( f: \mathbb{D}^k \rightarrow \mathbb{R}^k \\), where \\( \mathbb{D}^k \\) is the domain of the \\(k\\)-dimensional parameter \\( \theta \\).
  • In our case, \\( k = 1 \\), the parameter is \\( \rho \\), \\( \mathbb{D}^k = [0, 1] \\). Let \\( f = \rr{logit} \\).
  • Let \\( \psi = f(\theta) \\), and \\( |J(\psi)| \\) be the corresponding determinant of the change of variables. Use a symmetric proposal on \\( \psi \\).

Acceptance ratio

$$\begin{align} r & = \frac{p(f^{-1}(\psi^{\rr{proposed}})) \cdot |J(\psi^{\rr{proposed}})|}{p(f^{-1}(\psi^{\rr{initial}})) \cdot |J(\psi^{\rr{initial}})|} \\ &= \frac{p(\theta^{\rr{proposed}})}{p(\theta^{\rr{initial}})} \cdot \frac{|J(\psi^{\rr{proposed}})|}{|J(\psi^{\rr{initial}})|} \end{align}$$

Accept with probability \\( \rr{min} (1, r) \\) as usual.

Example

logit transformation

  • \\( f^{-1} = \rr{invlogit} \\)
  • \\( J(\psi) = \rr{invlogit}'(\psi) \\)

$$\begin{align}\rr{invlogit}'(\psi) & = \left( {\rr{exp}(\psi) \over 1 + \rr{exp}(\psi)}\right)' \\ & = {\rr{exp}(\psi) \over \left(1 + \rr{exp}(\psi) \right)^2}\end{align}$$

HMC - introduction

Read on HMC

Announcements

  • T-shirt competition winner!
  • No new homework till Wednesday after spring break.
  • Happy spring break!

Resources

Final slide

  • Next lecture: Missing data + MCMC, more on HMC