that can be executed by a number of independent workers (possibly, in communication with each other), can be reframed as a parallel computing exercise.

- Likelihood calculation.
- MCMC chains.
- Optimization.
- Data processing.

Parallelization of one of the following types:

- In the document authorship example, determining features and calculating feature counts
- In the optimization of marketing campaigns example, preparing data for the fitting procedure and optimization, calculating sufficient statistics
- And so on

Document authorship example

- Document authorship \\( \pi_d \sim \rr{Dirichlet}_K(\vec{\alpha},1) \\).
- Data \\( y_{dw} \given \pi_d \sim \rr{Poisson}(\sum_k\mu_{kw}\pi_{dk})\\).
- Author \\( k\\), document \\( d \\), word \\( w \\).
- \\( \pi_{dk} \\) represents the amount of contribution of author \\( k \\) do document \\(d\\).

For document \\( d \\):

$$\begin{align}L(\alpha, \mu, \pi \given y) \propto & \prod_{w=1}^W\left[\left(\sum_k\mu_{kw}\pi_{dk}\right)^{y_{dw}} \right. \\ & \left.\cdot e^{-\left(\sum_k\mu_{kw}\pi_{dk}\right)}\right] \cdot \prod_{k=1}^K \pi_{dk}^{\alpha_k - 1}\end{align}$$

We really want to compute the log-likelihood, however.

For document \\( d \\):

$$\begin{align}l(\alpha, \mu, \pi & \given y) \propto \sum_{w=1}^W\left[y_{dw}\rr{log}\left(\sum_k\mu_{kw}\pi_{dk}\right)\right. \\ & \left. -\sum_k\mu_{kw}\pi_{dk}\right] + \sum_{k=1}^K (\alpha_k - 1)\rr{log}\pi_{dk}\end{align}$$

This is just for a single document \\( d \\), so can parallelize if we have many documents. How to do that?

Optimize decision \\( \delta(X) \\) based on inferred parameters

$$ \int C(\theta, \delta(x)) p(\theta \given y, x) d \theta$$

- Good when doing Gaussian mixture approximation.
- Full sampling approach can still be parallelized:
- Parallel tempering
- Approximate methods such as variational inference (careful, approximation!)

- What is the price for parallelization?
- Specifically, communication speed.
- Implementation difficulty.
- Is it worth it?

- Multithreading
- MPI
- GPU
- Grid computing

- Processes that can send each other messages.
- There are many functions that facilitate this (mpi_send, mpi_recv etc).
- Different languages have wrappers around MPI for ease of use.
- For example, Rmpi for R.

- Good at many simple arithmetic operations.

- Speed:
- Multithreading \\(\sim\\) GPU > MPI* > Grid

- Implementation in R (">" = faster)
- Multithreading > MPI > \\(\rr{GPU}=\rr{Grid}= \infty \\)

Multithreading is easy with R package multicore.

*MPI can be fast when using CPU cores are workers.

- Multithreading is fast but limited to the number of threads per processor.
- MPI is limited by the internode communication overhead, but clusters can (and are) constructed with low inter-node communication cost.
- GPUs are hard to program (CUDA), but are very fast at repeated simple calculations.
- Grids have very slow communication speeds, but can have massive numbers of computers.

- Father Big Table
- Hive
- Pig
- HBASE
- Many others
- Does it matter?

- Infrastructure costs are low
- Amazon AWS (rocks)
- Multitude of vendors

- Rapidly becoming common
- Accessible implementations (R, Python, C++11)
- Clusters are becoming faster
- It is intuitive and less restrictive

- multicore (very useful for emb. parallelization)
- snow, snowfall
- Rmpi

- Run several independent jobs in parallel, collect the results
- Parallel sapply (mpi.applyLB)
- What we've been doing on Odyssey

```
#mpi.applyLB
x=1:7
mpi.applyLB(x, rnorm, mean = 2, sd = 4)
#get the same simulation
mpi.remote.exec(set.seed(111))
mpi.applyLB(x, rnorm, mean = 2, sd = 4)
#mpi.parApply
x=1:24
dim(x)=c(2, 3, 4)
mpi.parApply(x, MARGIN=c(1, 2), fun = mean, job.num = 5)
```

- This is when the worker nodes communicate with each other in the process of executing the jobs.
- Such scheme is generally more robust as can be designed not to depend heavily on master node.
- It's also lower-level and trickier to implement and debug.

- Why is it so? Student discussion.

```
mpi.isend(x, type, dest, tag, comm = 1, request=0)
mpi.irecv(x, type, source, tag, comm = 1, request = 0)
```

These are non-blocking calls.

Basic examples:

```
#on a slave
mpi.send(1:10, 1, 0, 0)
#on master
x <- integer(10)
mpi.irecv(x, 1, 1, 0)
x
mpi.wait()
x
```

- A generic question "how can I speed things up?"
- Starts with "going back to the blackboard"
- Not just MPI

- Document authorship example

$$\begin{align}l(\alpha, \mu, \pi & \given y) \propto \sum_{w=1}^W\left[y_{dw}\rr{log}\left(\sum_k\mu_{kw}\pi_{dk}\right)\right. \\ & \left. -\sum_k\mu_{kw}\pi_{dk}\right] + \sum_{k=1}^K (\alpha_k - 1)\rr{log}\pi_{dk}\end{align}$$

- Parallel Tempering
- Equi-energy sampler

- Faster code/languages

- T-shirt competition winner!
- April 4 second final project submission deadline.

- Slides nesterko.com/lectures/stat221-2012/lecture17
- Class website theory.info/harvardstat221
- Class Piazza piazza.com/class#spring2013/stat221
- Class Twitter twitter.com/harvardstat221

- Next lecture: Parallel Tempering - nuts and bolts.