Parallel statistical computing

Stat 221, Lecture 17

@snesterko

Where we are

What is parallel computing

Any algorithm

that can be executed by a number of independent workers (possibly, in communication with each other), can be reframed as a parallel computing exercise.

  • Likelihood calculation.
  • MCMC chains.
  • Optimization.
  • Data processing.

Parallelization of one of the following types:

Example: data processing

  • In the document authorship example, determining features and calculating feature counts
  • In the optimization of marketing campaigns example, preparing data for the fitting procedure and optimization, calculating sufficient statistics
  • And so on

Example: likelihood

Document authorship example

  • Document authorship \\( \pi_d \sim \rr{Dirichlet}_K(\vec{\alpha},1) \\).
  • Data \\( y_{dw} \given \pi_d \sim \rr{Poisson}(\sum_k\mu_{kw}\pi_{dk})\\).
  • Author \\( k\\), document \\( d \\), word \\( w \\).
  • \\( \pi_{dk} \\) represents the amount of contribution of author \\( k \\) do document \\(d\\).

Likelihood

For document \\( d \\):

$$\begin{align}L(\alpha, \mu, \pi \given y) \propto & \prod_{w=1}^W\left[\left(\sum_k\mu_{kw}\pi_{dk}\right)^{y_{dw}} \right. \\ & \left.\cdot e^{-\left(\sum_k\mu_{kw}\pi_{dk}\right)}\right] \cdot \prod_{k=1}^K \pi_{dk}^{\alpha_k - 1}\end{align}$$

We really want to compute the log-likelihood, however.

Log likelihood

For document \\( d \\):

$$\begin{align}l(\alpha, \mu, \pi & \given y) \propto \sum_{w=1}^W\left[y_{dw}\rr{log}\left(\sum_k\mu_{kw}\pi_{dk}\right)\right. \\ & \left. -\sum_k\mu_{kw}\pi_{dk}\right] + \sum_{k=1}^K (\alpha_k - 1)\rr{log}\pi_{dk}\end{align}$$

This is just for a single document \\( d \\), so can parallelize if we have many documents. How to do that?

Example: optimization

Optimize decision \\( \delta(X) \\) based on inferred parameters

$$ \int C(\theta, \delta(x)) p(\theta \given y, x) d \theta$$

  • Good when doing Gaussian mixture approximation.
  • Full sampling approach can still be parallelized:
    • Parallel tempering
    • Approximate methods such as variational inference (careful, approximation!)

Questions that arise

  • What is the price for parallelization?
  • Specifically, communication speed.
  • Implementation difficulty.
  • Is it worth it?

Types of parallel computing

  • Multithreading
  • MPI
  • GPU
  • Grid computing

Multithreading

MPI

MPI characterized

  • Processes that can send each other messages.
  • There are many functions that facilitate this (mpi_send, mpi_recv etc).
  • Different languages have wrappers around MPI for ease of use.
  • For example, Rmpi for R.

GPU

  • Good at many simple arithmetic operations.

Grid computing

Comparisons

  • Speed:
    • Multithreading \\(\sim\\) GPU > MPI* > Grid
  • Implementation in R (">" = faster)
    • Multithreading > MPI > \\(\rr{GPU}=\rr{Grid}= \infty \\)

Multithreading is easy with R package multicore.

*MPI can be fast when using CPU cores are workers.

Main aspects

  • Multithreading is fast but limited to the number of threads per processor.
  • MPI is limited by the internode communication overhead, but clusters can (and are) constructed with low inter-node communication cost.
  • GPUs are hard to program (CUDA), but are very fast at repeated simple calculations.
  • Grids have very slow communication speeds, but can have massive numbers of computers.

Where Hadoop comes in

Other Hadoop-like things

  • Father Big Table
  • Hive
  • Pig
  • HBASE
  • Many others
  • Does it matter?

Services for distributed computing

  • Infrastructure costs are low
  • Amazon AWS (rocks)
  • Multitude of vendors

Why we focus on MPI

  • Rapidly becoming common
  • Accessible implementations (R, Python, C++11)
  • Clusters are becoming faster
  • It is intuitive and less restrictive

R libraries for parallel computing

  • multicore (very useful for emb. parallelization)
  • snow, snowfall
  • Rmpi

Embarrassingly parallel computing

  • Run several independent jobs in parallel, collect the results
  • Parallel sapply (mpi.applyLB)
  • What we've been doing on Odyssey

Rmpi code snippets

#mpi.applyLB
x=1:7
mpi.applyLB(x, rnorm, mean = 2, sd = 4)
 
#get the same simulation
mpi.remote.exec(set.seed(111))
mpi.applyLB(x, rnorm, mean = 2, sd = 4)
 
#mpi.parApply
x=1:24
dim(x)=c(2, 3, 4)
mpi.parApply(x, MARGIN=c(1, 2), fun = mean, job.num = 5)

Non-embarrassingly parallel (NEP) computing

  • This is when the worker nodes communicate with each other in the process of executing the jobs.
  • Such scheme is generally more robust as can be designed not to depend heavily on master node.
  • It's also lower-level and trickier to implement and debug.

NEP computing is tricky

  • Why is it so? Student discussion.

Code snippets

mpi.isend(x, type, dest, tag, comm = 1, request=0)
mpi.irecv(x, type, source, tag, comm = 1, request = 0)

These are non-blocking calls.

Basic examples:

#on a slave
mpi.send(1:10, 1, 0, 0)
 
#on master
x <- integer(10)
mpi.irecv(x, 1, 1, 0)
x
mpi.wait()
x

Parallel statistical computing

  • A generic question "how can I speed things up?"
  • Starts with "going back to the blackboard"
  • Not just MPI

Parallel likelihood calculation within MCMC

  • Document authorship example

$$\begin{align}l(\alpha, \mu, \pi & \given y) \propto \sum_{w=1}^W\left[y_{dw}\rr{log}\left(\sum_k\mu_{kw}\pi_{dk}\right)\right. \\ & \left. -\sum_k\mu_{kw}\pi_{dk}\right] + \sum_{k=1}^K (\alpha_k - 1)\rr{log}\pi_{dk}\end{align}$$

Examples of parallelizable algorithms

  • Parallel Tempering
  • Equi-energy sampler

Parallel Tempering

Other ways to speed things up

  • Faster code/languages

Announcements

  • T-shirt competition winner!
  • April 4 second final project submission deadline.

Resources

Final slide

  • Next lecture: Parallel Tempering - nuts and bolts.