# Likelihood + Prior = Posterior (Bayesian inference)

### Stat 221, Lecture 13

@snesterko

• Recap what we've learned about EM.
• Strength of prior
• The Metropolis Algorithm
• Gibbs sampling
• Peek at HMC

## Recap EM

• A half-Bayesian algorithm
• MLE with missing data/latent variables
• The E step
• The M step
• EM for HMMs
• Tweaks of the EM algorithm
• Data augmentation
• ECM (cond. max. - like Gibbs)
• PX-EM (speed up convergence)
• mcEM (danger approx. method)

## Summary: EM is cool!

• A great approximate/quick method*

*with different caveats

$$P(A \given B) = \frac{P(A, B)}{P(B)}$$

• "Everything is a random variable."
• There are quantities we condition on that are considered fixed, all others are random variables.

## Classic coinflip example

$$\theta \sim \rr{Beta}(\alpha, \beta), \rr{ } Y_i \given \theta \sim \rr{Bernoulli}(\theta)$$

$$\rr{versus } Y_i \sim \rr{Bernoulli}(\theta), \rr{ } \theta \rr{ unknown}$$

## Prior, likelihood, posterior

• Prior \$$\theta \sim \pi(\theta) \$$.
• Likelihood \$$y \given \theta \sim p(y \given \theta) \$$.
• Posterior \$$p( \theta \given y) \propto p (\theta, y) = \pi(\theta) p(y \given \theta) \$$.

## The coinflips

$$\pi(\theta) = \frac{1}{\rr{B}(\alpha, \beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$$

$$p(y_i \given \theta) = \left( \begin{array} Nn \\ y_i\end{array}\right) \theta^{y_i} (1 - \theta)^{1-y_i}$$

$$p(\theta \given y_i) = \rr{Beta}(\alpha + y_1, \beta + 1 - y_i)$$

• Conjugate models.

## Normal-normal model

\begin{align}X_i \given \mu & \mathop{\sim}^{\rr{iid}} \rr{Normal}\left( \mu, \sigma^2 \right), \rr{ } i = 1..n; \\ \mu & \sim \rr{Normal}\left( \mu_0, \sigma_0^2\right)\end{align}

• Is this a conjugate model?

## Posterior

\begin{align} p(\mu|X) & = \pi(\mu)\prod_{i=1}^n \tfrac{1}{\sqrt{2\pi\sigma^2}}\, e^{-(x_i-\mu)^2/(2\sigma^2)} \\ & = \pi(\mu)(2\pi\sigma^2)^{-n/2}\, e^{ -\sum_{i=1}^n(x_i-\mu)^2/(2\sigma^2)} \\ & = \pi(\mu) (2\pi\sigma^2)^{-n/2}\, e^{ -\sum_{i=1}^n{( (x_i-\bar{x}) - (\mu-\bar{x}) )^2 \over 2\sigma^2}} \\ & \propto e^{ {-1\over2\sigma_0^2} (\mu-\mu_0)^2} e^{ {-n\over2\sigma^2}(\mu-\bar{x})^2 } \end{align}

## Bayesian point estimation

• MLE changes meaning, it is now MAP.
• MAP+curvature at the mode is still OK given the posterior looks Normal.

## Problem set example

• Complete-data log-likelihood:

\begin{align} & l(B \given \{\vec{Z}_i, C_{ijk}\} ) = \rr{const} + \\ & \sum_{i=1}^I \left( \sum_{j \in J_i} \sum_{k \in K_{ij}} \left[ \vec{x}_{jk}' B - \rr{log} \sum_{l = 1}^{L_j} e^{ \vec{x}_{jl}' B} \right] \right) \cdot \vec{Z}_i \end{align}

## Notation

Let

$$\rr{log}\vec{a}_i = \sum_{j \in J_i} \sum_{k \in K_{ij}} \left[ \vec{x}_{jk}' B - \rr{log} \sum_{l = 1}^{L_j} e^{ \vec{x}_{jl}' B} \right]$$

## With a prior on \$$Z_i \$$

Let

$$Z_i \sim \rr{Multinom}_N \left(\vec{p}_i\right)$$

\begin{align} \rr{log}p(B, Z \given C_{ijk} ) = & \rr{const} + \\ & \sum_{i=1}^I \left( \rr{log} \vec{p}_i + \rr{log}\vec{a}_i\right) \cdot \vec{Z}_i \end{align}

## Example posterior

\begin{align}\vec{Z}_i \given B, \rr{ data} \sim \rr{Multinom}_N \left( \vec{p}_i \vec{a}_i\right)\end{align}

*probabilities proportional to; element-wise vector product

• Posterior of \$$B \$$?

## Strength of the prior

• Cool idea -- measure the strength of prior in terms of the number of data points.
• Depends on the model.
• Current work of Xiao-Li Meng (JSM 2012, session "Distributional Inference for Nonparametric Problems").
• Condition on, or integrate over the observed data?

## Strength of prior in example

• Strength of prior in homework.
• Let's assume we have a 1000 data points and all \$$\vec{a}_i \propto 1/N \cdot \vec{1} \$$
• For simplicity, assume \$$N= 2\$$.
• But let's put $$\vec{p}_i = \left( \begin{array} 00.01 \\ 0.99\end{array}\right)$$
• Posterior distribution of \$$Z_i \given B, \rr{ data} \$$?
• Reason for having a diffuse prior: it is dogmatic.

## Metropolis algorithm

BDA pp. 289-292:

1. Draw starting point \$$\theta^0 \$$.
2. Sample proposal \$$\theta^* \$$ from jumping distribution \$$J_t (\theta^* \given \theta^{t-1}) \$$. The jumping distribution must be symmetric, i.e. \$$J_t (\theta_a \given \theta_b) = J_t (\theta_b \given \theta_a) \$$.
3. Calculate the ratio of the densities \$$r = \frac{p(\theta^*\given y)}{p(\theta^{t-1} \given y)}\$$.
4. Set \$$\theta^t = \begin{cases} \theta^* & \rr{with probability min} (r, 1) \\ \theta^{t-1} & \rr{otherwise}\end{cases} \$$

## Proof of Metropolis

Reference: BDA pp. 290-291

1. The constructed chain is aperiodic, irreducible, and not transient: there exists a unique stationary distribution.
2. The stationary distribution is the target distribution.

Detailed balance:

$$p(\theta_a) J(\theta_a \given \theta_b) = p(\theta_b) J (\theta_b \given \theta_a)$$

## Example of Metropolis

A basic setup:

• Target density \$$p(\theta \given y) = \rr{N}_2(\theta \given 0, I) \$$
• Proposal density \$$\rr{N}(\theta^* \given \theta^{(t-1)}, 0.2^2I) \$$
• Jumping rule is clearly symmetric.
• Then at each step the acceptance probability $$\rr{min}\left( 1, {\rr{N}(\theta^* \given 0, I) \over \rr{N}(\theta^{(t-1)} \given 0, I)}\right)$$

## Gibbs sampler

A simple yet powerful concept: divide and conquer.

• BDA pp. 287-289.
• "Each iteration of the Gibbs sampler cycles through the subvectors of \$$\theta \$$, drawing each subset conditional on the calue of all the others."

## Example: bivariate Normal

Posterior distribution: $$\left.\left( \begin{array} d\theta_1 \\ \theta_2\end{array}\right)\right| \rr{ }y \sim \rr{N} \left( \left( \begin{array} dy_1 \\ y_2\end{array}\right), \left( \begin{array} n1 & \rho \\ \rho & 1 \end{array} \right) \right)$$

Gibbs updates: \begin{align} \theta_1 \given \theta_2, y & \sim \rr{N} (y_1 + \rho (\theta_2 - y_2), 1 - \rho^2) \\ \theta_2 \given \theta_1, y & \sim \rr{N} (y_2 + \rho (\theta_1 - y_1), 1 - \rho^2)\end{align}

## Proof of Gibbs 1

Define \$$x \sim_j y \$$ if \$$\left.x_i = y_i\right.\$$ for all \$$i \neq j\$$ and let \$$\left.p_{xy}\right.\$$ denote the probability of a jump from \$$x \in \Theta \$$ to \$$y \in \Theta\$$. Then, the transition probabilities are

$$p_{xy} = \begin{cases} \frac{1}{d}\frac{g(y)}{\sum_{z \in \Theta: z \sim_j x} g(z) } & x \sim_j y \\ 0 & \text{otherwise} \end{cases}$$

Reference: Wikipedia.

## Proof of Gibbs 2

So \begin{align}g(x) p_{xy} & = \frac{1}{d}\frac{ g(x) g(y)}{\sum_{z \in \Theta: z \sim_j x} g(z) } \\ & = \frac{1}{d}\frac{ g(y) g(x)}{\sum_{z \in \Theta: z \sim_j y} g(z) } = g(y) p_{yx}\end{align}

since \$$x \sim_j y \$$ is an equivalence relation.

So, the detailed balance is satisfied, Gibbs works.

## Metropolis-within-Gibbs

• Can replace the exact draw of a subvector of \$$\theta \$$ with a Metropolis iteration (or a few Metropolis iterations).

## Ex.: Normal, \$$\rho \$$ unknown

$$\left.\left( \begin{array} d\theta_1 \\ \theta_2\end{array}\right)\right| \rr{ }y, \rho \sim \rr{N} \left( \left( \begin{array} dy_1 \\ y_2\end{array}\right), \left( \begin{array} n1 & \rho \\ \rho & 1 \end{array} \right) \right)$$

Gibbs updates: \begin{align} \theta_1 \given \theta_2, y & \sim \rr{N} (y_1 + \rho (\theta_2 - y_2), 1 - \rho^2) \\ \theta_2 \given \theta_1, y & \sim \rr{N} (y_2 + \rho (\theta_1 - y_1), 1 - \rho^2)\end{align}

$$\rho \given \theta_1, \theta_2, y \sim \rr{ draw via Metropolis}$$

## Difficulty

• The support of \$$\rho \$$ is on \$$[0, 1] \$$, hard to design a symmetric jumping distribution.

## Metropolis on transformed space

• Define a function \$$f: \mathbb{D}^k \rightarrow \mathbb{R}^k \$$, where \$$\mathbb{D}^k \$$ is the domain of the \$$k\$$-dimensional parameter \$$\theta \$$.
• In our case, \$$k = 1 \$$, the parameter is \$$\rho \$$, \$$\mathbb{D}^k = [0, 1] \$$. Let \$$f = \rr{logit} \$$.
• Let \$$\psi = f(\theta) \$$, and \$$|J(\psi)| \$$ be the corresponding determinant of the change of variables. Use a symmetric proposal on \$$\psi \$$.

## Acceptance ratio

\begin{align} r & = \frac{p(f^{-1}(\psi^{\rr{proposed}})) \cdot |J(\psi^{\rr{proposed}})|}{p(f^{-1}(\psi^{\rr{initial}})) \cdot |J(\psi^{\rr{initial}})|} \\ &= \frac{p(\theta^{\rr{proposed}})}{p(\theta^{\rr{initial}})} \cdot \frac{|J(\psi^{\rr{proposed}})|}{|J(\psi^{\rr{initial}})|} \end{align}

Accept with probability \$$\rr{min} (1, r) \$$ as usual.

## Example

logit transformation

• \$$f^{-1} = \rr{invlogit} \$$
• \$$J(\psi) = \rr{invlogit}'(\psi) \$$

\begin{align}\rr{invlogit}'(\psi) & = \left( {\rr{exp}(\psi) \over 1 + \rr{exp}(\psi)}\right)' \\ & = {\rr{exp}(\psi) \over \left(1 + \rr{exp}(\psi) \right)^2}\end{align}

### HMC - introduction

• A great overview by Radford Neal:

## Announcements

• T-shirt competition winner!
• No new homework till Wednesday after spring break.
• Happy spring break!

## Final slide

• Next lecture: Missing data + MCMC, more on HMC