 # Parallel statistical computing

@snesterko

### Where we are ## What is parallel computing ## Any algorithm

that can be executed by a number of independent workers (possibly, in communication with each other), can be reframed as a parallel computing exercise.

• Likelihood calculation.
• MCMC chains.
• Optimization.
• Data processing.

Parallelization of one of the following types:  ## Example: data processing

• In the document authorship example, determining features and calculating feature counts
• In the optimization of marketing campaigns example, preparing data for the fitting procedure and optimization, calculating sufficient statistics
• And so on

## Example: likelihood

Document authorship example

• Document authorship \$$\pi_d \sim \rr{Dirichlet}_K(\vec{\alpha},1) \$$.
• Data \$$y_{dw} \given \pi_d \sim \rr{Poisson}(\sum_k\mu_{kw}\pi_{dk})\$$.
• Author \$$k\$$, document \$$d \$$, word \$$w \$$.
• \$$\pi_{dk} \$$ represents the amount of contribution of author \$$k \$$ do document \$$d\$$.

## Likelihood

For document \$$d \$$:

\begin{align}L(\alpha, \mu, \pi \given y) \propto & \prod_{w=1}^W\left[\left(\sum_k\mu_{kw}\pi_{dk}\right)^{y_{dw}} \right. \\ & \left.\cdot e^{-\left(\sum_k\mu_{kw}\pi_{dk}\right)}\right] \cdot \prod_{k=1}^K \pi_{dk}^{\alpha_k - 1}\end{align}

We really want to compute the log-likelihood, however.

## Log likelihood

For document \$$d \$$:

\begin{align}l(\alpha, \mu, \pi & \given y) \propto \sum_{w=1}^W\left[y_{dw}\rr{log}\left(\sum_k\mu_{kw}\pi_{dk}\right)\right. \\ & \left. -\sum_k\mu_{kw}\pi_{dk}\right] + \sum_{k=1}^K (\alpha_k - 1)\rr{log}\pi_{dk}\end{align}

This is just for a single document \$$d \$$, so can parallelize if we have many documents. How to do that?

## Example: optimization

Optimize decision \$$\delta(X) \$$ based on inferred parameters

$$\int C(\theta, \delta(x)) p(\theta \given y, x) d \theta$$

• Good when doing Gaussian mixture approximation.
• Full sampling approach can still be parallelized:
• Parallel tempering
• Approximate methods such as variational inference (careful, approximation!)

## Questions that arise

• What is the price for parallelization?
• Specifically, communication speed.
• Implementation difficulty.
• Is it worth it?

## Types of parallel computing

• MPI
• GPU
• Grid computing ## MPI ## MPI characterized

• Processes that can send each other messages.
• There are many functions that facilitate this (mpi_send, mpi_recv etc).
• Different languages have wrappers around MPI for ease of use.
• For example, Rmpi for R.

## GPU • Good at many simple arithmetic operations.

## Grid computing ## Comparisons

• Speed:
• Multithreading \$$\sim\$$ GPU > MPI* > Grid
• Implementation in R (">" = faster)
• Multithreading > MPI > \$$\rr{GPU}=\rr{Grid}= \infty \$$

Multithreading is easy with R package multicore.

*MPI can be fast when using CPU cores are workers.

## Main aspects

• Multithreading is fast but limited to the number of threads per processor.
• MPI is limited by the internode communication overhead, but clusters can (and are) constructed with low inter-node communication cost.
• GPUs are hard to program (CUDA), but are very fast at repeated simple calculations.
• Grids have very slow communication speeds, but can have massive numbers of computers. • Father Big Table
• Hive
• Pig
• HBASE
• Many others
• Does it matter?

## Services for distributed computing

• Infrastructure costs are low
• Amazon AWS (rocks)
• Multitude of vendors

## Why we focus on MPI

• Rapidly becoming common
• Accessible implementations (R, Python, C++11)
• Clusters are becoming faster
• It is intuitive and less restrictive

## R libraries for parallel computing

• multicore (very useful for emb. parallelization)
• snow, snowfall
• Rmpi

## Embarrassingly parallel computing

• Run several independent jobs in parallel, collect the results
• Parallel sapply (mpi.applyLB)
• What we've been doing on Odyssey

## Rmpi code snippets

#mpi.applyLB
x=1:7
mpi.applyLB(x, rnorm, mean = 2, sd = 4)

#get the same simulation
mpi.remote.exec(set.seed(111))
mpi.applyLB(x, rnorm, mean = 2, sd = 4)

#mpi.parApply
x=1:24
dim(x)=c(2, 3, 4)
mpi.parApply(x, MARGIN=c(1, 2), fun = mean, job.num = 5)

## Non-embarrassingly parallel (NEP) computing

• This is when the worker nodes communicate with each other in the process of executing the jobs.
• Such scheme is generally more robust as can be designed not to depend heavily on master node.
• It's also lower-level and trickier to implement and debug.

## NEP computing is tricky

• Why is it so? Student discussion.

## Code snippets

mpi.isend(x, type, dest, tag, comm = 1, request=0)
mpi.irecv(x, type, source, tag, comm = 1, request = 0)

These are non-blocking calls.

Basic examples:

#on a slave
mpi.send(1:10, 1, 0, 0)

#on master
x <- integer(10)
mpi.irecv(x, 1, 1, 0)
x
mpi.wait()
x

## Parallel statistical computing

• A generic question "how can I speed things up?"
• Starts with "going back to the blackboard"
• Not just MPI

## Parallel likelihood calculation within MCMC

• Document authorship example

\begin{align}l(\alpha, \mu, \pi & \given y) \propto \sum_{w=1}^W\left[y_{dw}\rr{log}\left(\sum_k\mu_{kw}\pi_{dk}\right)\right. \\ & \left. -\sum_k\mu_{kw}\pi_{dk}\right] + \sum_{k=1}^K (\alpha_k - 1)\rr{log}\pi_{dk}\end{align}

## Examples of parallelizable algorithms

• Parallel Tempering
• Equi-energy sampler

### Parallel Tempering ## Other ways to speed things up

• Faster code/languages

## Announcements

• T-shirt competition winner!
• April 4 second final project submission deadline.

## Final slide

• Next lecture: Parallel Tempering - nuts and bolts.