 MPI for Parallel Tempering

Stat 221, Lecture 19

@snesterko

• MPI explained
• Messages and mailboxes
• Asynchronous (nonblocking) communication
• Parallel Tempering-specific elements
• Random number streams
• Intro to Equi-Energy Sampler Example: homework

Model for ranking data based on rater class membership \$$Z_i \$$.

$$Z_i \sim \rr{Multinom}_N \left(\vec{p}_i\right)$$

\begin{align} & l(B \given \{\vec{Z}_i, C_{ijk}\} ) = \rr{const} + \\ & \sum_{i=1}^I \left( \sum_{j \in J_i} \sum_{k \in K_{ij}} \left[ \vec{x}_{jk}' B - \rr{log} \sum_{l = 1}^{L_j} e^{ \vec{x}_{jl}' B} \right] \right) \cdot \vec{Z}_i \end{align}

What could be parallelized?

Pretend the data is big

• Map the data by rater/rater group, calculate the likelihood contribution for each, reduce to the whole likelihood.
• Likelihood needs to be computed many times.
• What would be the speed of that?

Communication speed becomes a big issue. Could be 15x or more slower than doing locally. What parts of log-posterior fit this structure?

Zooming out

• To successfully parallelize in a non-GPU setting, need to be looking at a higher-level parallelism.
• We could, for example, run several chains in parallel and combine/analyze the results.
• Anything else? MPI

• Non-embarrassingly parallel computing is a happy alternative to slow non-GPU embarrassingly parallel computing that allows to gain performance over serial computing.
• MPI facilitates this with a flexible communication protocol.

How to make it parallel

• Run everything on one node (multicore vs no parallelization).
• Can tell worker nodes to run chains.
• Handling replica exchange communication (main thing):
• All through the master node.
• Allow maximum inter-worker communication, independent of the master.
• Pros and cons of each method?

Blocking vs non-blocking design

• Blocking design is when your parallel computing system sits idle waiting for something to occur.
• Non-blocking design is when we minimize unnecessary wait times so that the nodes do what they are supposed to do (compute) and don't sit idle.

Non-blocking design

• Want to minimize inter-node communication
• Running identical random number streams on workers.
• Communicate directly with another worker and not through master.

Getting started with MPI

• It can work using cores of your personal computer (fast communication).
• It can work on your cloud computing cluster (Rmpi on Odyssey).
• Working with MPI is about understanding asynchronous message passing between worker nodes.

Starting MPI

library(Rmpi)
master.initializeWorkers <- function (mpistt, lgr = NULL)
{
## fix the R exit function
master.fixLast()

## start MPI workers
mpi.spawn.Rslaves(nslaves = mpisttnworkers) ## all set invisible(0) } master.initializeWorkers() Now that we have workers ready • Time to think what they should do. • Go back to the blackboard: In the homework example, let $$\rr{log}\vec{a}_i = \sum_{j \in J_i} \sum_{k \in K_{ij}} \left[ \vec{x}_{jk}' B - \rr{log} \sum_{l = 1}^{L_j} e^{ \vec{x}_{jl}' B} \right]$$ Need to sample from posterior With a prior on \$$Z_i \$$: $$Z_i \sim \rr{Multinom}_N \left(\vec{p}_i\right)$$ \begin{align} \rr{log}p(B, Z \given C_{ijk} ) = & \rr{const} + \\ & \sum_{i=1}^I \left( \rr{log} \vec{p}_i + \rr{log}\vec{a}_i\right) \cdot \vec{Z}_i \end{align} Definitions for Parallel Tempering Define \$$\theta = \{B, Z \} \$$, and then, conveniently, \$$E(\theta) = - \rr{log} p(B, Z \given \rr{data}) \$$, and $$\Pi (X) = \prod_{i=1}^N {e^{- \beta_i E(\theta_i)} \over D(T_i)}$$ with \$$\beta_i = {1 \over T_i }\$$ and \$$D(T_i) = \int e^{- \beta_i E(x_i)} dx_i\$$. \$$T_i \$$'s are temperatures for parallel chains. The goal • We'd like to simulate from \$$p(\theta \given \rr{data}) \$$. • \$$\theta_i \$$ are distinct copies of \$$\theta \$$ running in parallel chains (replicas). • PT will attempt to swap replicas from time to time according to acceptance probability: \begin{align} &P_{\rr{replica}} (\theta_i \leftrightarrow \theta_{i+1}) = \\ & = \rr{min} \left( 1, e^{(\beta_i - \beta_{i+1})(E(\theta_i) - E(\theta_{i+1}))}\right)\end{align} Note: replica exchange probability is symmetric. Workers need to know their objects ## broadcast require/library command for all needed packages mpi.bcast.cmd(cmd = library('MCMCpack'), rank = 0, nonblock = FALSE) mpi.bcast.cmd(cmd = library('rstream'), rank = 0, nonblock = FALSE) mpi.bcast.cmd(cmd = library('combinat'), rank = 0, nonblock = FALSE) ## broadcast all needed functions (those whose names start ## with lpl, worker, or mpi) funcNames <- ls(envir = .GlobalEnv) lplfuns <- funcNames[grep("lpl|worker|mpi", funcNames)] for (fun in lplfuns) { mpi.bcast.NamedObj2slave(obj = eval(as.name(fun)), nam = fun) } We are literally populating the little R sessions running on workers. Worker random number streams master.setupRstreams <- function (mpistt, base, lgr = NULL) { ## NOTE : this creates two global variables in each worker: ## stream -- a variable with all random number generators ## rns -- a list with all generated random number streams streams <- sapply(1:mpisttnworkers, function (ii)
{
master.makePackedRstreams(nworkers = mpistt$nworkers, base = base, seed = mpistt$seed,
lgr = lgr)
}, simplify = FALSE)
## send random streams to workers
for (ii in 1:mpistt$nworkers) { mpi.send.Robj(obj = streams[[ii]], dest = ii, tag = 10) } ## broadcast all workers to receive their streams ## broadcast all workers to generate random number streams ## and store them in the rns object (to be much used later) mpi.bcast.cmd({ stream <- mpi.recv.Robj(source = mpi.any.source(), tag = mpi.any.tag()) rns <- worker.generateRstreams(nn = stt$niter,
stream = stream, lgr = lgr)
})

invisible(0)
}

PT random number streams • For this, need to be able to initialize random number streams at the same value and send them to workers.

Making rstreams

## rstream is faster than other libraries*
library(rstream)
dummy <- new("rstream.mrg32k3a", seed = 1:6)
rstream.sample(dummy, 1)

For details on the rstream library, see this paper. How it all works taken from the reference paper.

This looks confusing

• The formalism makes it look more complicated.
• It's easier to tell a story of how workers should behave.

Use message passing to communicate between workers.

How mailboxes work

• Seidning functions: mpi_send/mpi_isend (and variations)
• Receiving functions: mpi_recv/mpi_irecv
• Check mailbox: mpi_testany
• Message waiting: mpi_wait

Example

Worker 1

for (el in elements) {
mpi.isend(el, type = 2, dest = 2, tag = 1)
}
mpi.isend(0, type = 2, dest = 2, tag = 2)

Worker 2

done <- FALSE
while (!done) {
x <- 0
mpi.irecv(x, type = 2, source = mpi.any.source(),
tag = mpi.any.tag())
mpi.wait()
messageinfo <- mpi.get.sourcetag()
workerid <- messageinfo
tag      <- messageinfo
if (tag == 1) print(x) else done <- TRUE
}

Better async sending/receiving

## worker 1
mpi.send(1:10, type = 2, dest = 2, tag = 0)

## worker 2
x <- numeric(10)
mpi.irecv(x, type = 2, source = 1, tag = 0)
x
mpi.wait()
x

Can send and async receive vectors (unlike hw5 reference code).

Very useful embarrassing functions

• mpi.i*apply* (Rmpi Map-Reduce)
• Pset5: mpi.printObj (in lib-student.R)
mpi.printObj <- function (obj, nworkers, simplify = TRUE)
{
fun <- if (simplify) mpi.iparSapply else mpi.iparLapply
fun(1:nworkers, function (ee)
{
eval(substitute(obj))
})
}

Can use mpi.printObj to print objects from workers (super useful when debugging).

Go from top-down: master node

## Run the initial pre-loop commands of the algorithm - require, globals
master.initializeAlgo(mpistt = mpistt,
objs = list(datform = datform, datorig = datorig,
stt = stt),
lgr = lgr)
## Setup the random number streams on the workers -- important to call AFTER
## initializeAlgo because this uses a transferred worker function
master.setupRstreams(mpistt = mpistt, base = rsb, lgr = lgr)

## testing code
#dia <- mpi.printObj(obj = worker(), nworkers = mpisttnworkers, simplify = FALSE) #master.printWorkerStatuses(obj = dia, lgr = lgr) ## at this point the workers should be equipped with the needed functions and ## we should be good to go -- time to start the main loop mpi.bcast.cmd(cmd = { wres <- worker() }, rank = 0, nonblock = FALSE) ## master function res <- master(mpistt = mpistt, lgr = lgr) ## align the possibly switched temperatures, and for the clean object (draws ## at temperature 1) resA <- master.alignTemperatures(res = res, lgr = lgr) clean <- resA[[which(stttemp == 1)]]

# close slaves and exit.
mpi.close.Rslaves()

Main functions are giant loops:

master <- function (mpistt, lgr = NULL)
{
nworkers <- mpistt$nworkers closedworkers <- 0 res <- list() ii <- 0 while (closedworkers < nworkers && ii < mpistt$maxiter) {
ii <- ii + 1
# Receive a message from a worker
message <- mpi.recv.Robj(mpi.any.source(), mpi.any.tag())
messageinfo <- mpi.get.sourcetag()
workerid <- messageinfo
tag      <- messageinfo

if (tag == 1) {
## The worker is giving an update on the progress
say(lgr, 1, message)
} else if (tag == 2) {
## The worker is done, and the message contains the resulting
## object
say(lgr, 1, "Worker ", workerid, " is done")
res[[workerid]] <- message
closedworkers <- closedworkers + 1
}
}
res
}

Need to careful with

• mpi.irecv and mpi.isend
• These functions are asynchronous!

From worker.irecvObject:

## here define a global variable for receiving buffer
recvObj <<- 2012.2012
mpi.irecv(x = recvObj, type = 2, source = recSrc,
tag = mpi.any.tag())

Need to be constantly monitoring the recvObj buffer variable.

Worker irecv code

worker.irecvObject <- function (mpiReq, receiveFromSource, valueTag, lgr = NULL)
{
## receive object if there is no unprocessed object currently hanging
## around
if (mpiReq$allowReceiveObj) { if (!mpiReq$objRecvInProgress) {
mpiReq$obj <- list() } ## ensure to be receiving objects from a single source if (!mpiReq$irecvActive) {
## here define a global variable for receiving buffer
recvObj <<- 2013.2013
mpi.irecv(x = recvObj, type = 2, source = recSrc,
tag = mpi.any.tag())
mpiReq$irecvActive <- TRUE } ## check if the object has been received. If yes, populate obj with ## parsed values if (mpiReq$irecvActive) {
testRecv <- mpi.testany(count = 1)
mpiReq$currentTag <- -1 if (testRecv$flag) {
mpiReq$objRecvInProgress <- TRUE msginfo <- mpi.get.sourcetag() src <- msginfo mpiReq$currentTag <- msginfo
leninfo           <- mpi.get.count(type = 2)
if (src != SELF) {
mpiReq$obj$source <- src
for (nam in names(valueTag)) {
if (valueTag[[nam]] == mpiReq$currentTag) { temp <- recvObj + 0 mpiReq$obj[[nam]] <- worker.assign(temp)
break
}
}
}
mpiReq\$irecvActive <- FALSE
}
}
}
mpiReq
}

Announcements

• T-shirt comp.
• Final project - 2nd checkpoint.

Final slide

• Next lecture: Equi-Energy Sampler.