+ - 0:00:00
Notes for current slide
Notes for next slide

Using R on Migale

A short trip through purrr, future and furrr

Mahendra Mariadassou

INRAE - MaIAGE, Migale Team

2020-03-12 (updated: 2020-03-14)

1 / 26

Welcome to the Migale platform!

2 / 26
3 / 26

Prerequisites:

  • an account1: request one using the form

  • Working knowledge of unix2

[1] Requires an academic e-mail adress

[2] See here for a simple intro

4 / 26

Architecture

Migale is the front node

5 / 26

Architecture

Migale is the front node

Which connects you to the computer farm

--

Tens of nodes (~ machines) representing hundred of cores (~ CPU)

5 / 26

Level 1: Rstudio server

6 / 26

Connect to migale Rstudio server

Available at https://rstudio.migale.inrae.fr/

--

Super easy ☺️ but runs on a Virtual Machine 😞

7 / 26

Level 2: bash mode

8 / 26

Prepare your R script

create <- function(mu = 0) { rnorm(100, mean = mu) }
analyze <- function(x) { mean(x) }
results <- numeric(100)
for (i in 1:100) {
results[i] <- analyze(create())
}
saveRDS(results, "results.rds")
cat scripts/my_script_1.R
## create <- function(mu = 0) { rnorm(100, mean = mu) }
## analyze <- function(x) { mean(x) }
## results <- numeric(100)
## for (i in 1:100) {
## results[i] <- analyze(create())
## }
## saveRDS(results, "results.rds")
9 / 26

Running you script

  1. Transfer scripts and data to migale1

  2. Connect to migale

  3. Run script

  4. Work on the results

scp -r . mmariadasso@migale:SOTR/
ssh mmariadasso@migale
Rscript my_script_1.R

--

To make life easier add

Host migale
User mmariadasso
HostName migale.jouy.inra.fr

to your ~/.ssh/config file.

10 / 26

Running you script

  1. Transfer scripts and data to migale1

  2. Connect to migale

  3. Run script

  4. Work on the results

scp -r . mmariadasso@migale:SOTR/
ssh mmariadasso@migale
Rscript my_script_1.R

--

To make life easier add

Host migale
User mmariadasso
HostName migale.jouy.inra.fr

to your ~/.ssh/config file.

Quite easy ☺️ but uses only the front-node 😞

[1] You may need to expand migale to migale.jouy.inra.fr

10 / 26

Level 3: Using sge

11 / 26

About SGE

Sun Grid Engine is a job scheduler. You submit many jobs on the front-node and sge will dispatch them to the computer farm.

A long introduction to sge can be found here but here is a simple example

RSCRIPT=/usr/local/public/R/bin/Rscript
RJOB="my_script_1.R"
qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae $RJOB

RSCRIPT and RJOB are environment variables and are expanded in the final call.

Here we need to specify -S $RSCRIPT to make sure that the instructions in my_script_1.R are executed with R.

12 / 26

SGE options

Let's unpack the options:

  • -cwd run in current working directory
  • -V will pass all environment variables to the job
  • -N <jobname> name of the job. This you will see when you use qstat, to check status of your jobs.
  • -b y allow command to be a binary file instead of a script.

Other usefull options are:

  • -q <queue> set the queue. See here to choose a queue (short / long / bigmem / etc ) adapted to your needs.
  • -pe thread <n_slots> This specifies the parallel environment. thread runs a parallel job using shared-memory and n_processors amount of cores.
  • -R y allows to reserve resources as soon as they are free
  • -o <output_logfile> name of the output log file
  • -e <error_logfile> name of the error log file
  • -m bea Will send email when job begins, ends or aborts
  • -M <emailaddress> Email address to send email to
13 / 26

Leveraging the computer farm (I)

We're still only using one node at the time !!

Decompose your script and pass arguments

cat scripts/my_script_2.R
## Arguments
args <- commandArgs(trailingOnly = TRUE)
id <- as.integer(args[1])
## Computations
create <- function(mu = 0) { rnorm(100, mean = mu) }
analyze <- function(x) { mean(x) }
result <- analyze(create())
## Results
saveRDS(object = result, file = paste0("result_", id, ".rds"))
14 / 26

Leveraging the computer farm (II)

Use qsub repeatedly

RSCRIPT="/usr/local/public/R/bin/Rscript"
RJOB="my_script_2.R"
QSUB="qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae"
seq 1 100 | xargs -I {} $QSUB $RJOB {}

This is equivalent to

$QSUB $RJOB 1
...
$QSUB $RJOB 100

Aggregate all the results at the end

results <- numeric(100)
for (i in 1:100) results[i] <- readRDS(paste0("result_", i, ".rds"))
15 / 26

Monitoring your jobs

Use qstat on migale to monitor the state of your jobs: qw (waiting), Eqw (error), t (transferring), r (running)

Some pros and cons

➕ Quite easy if you want to parallelize loops (simulations)

➕ uses many machines (!= many cores on a machine)

➖ Forth and back between R and bash

➖ Not perfect for numerical experiments (machines are not perfectly similar)

16 / 26

Level 4: Using future

17 / 26

Future/Future.batchtools package

  1. future allows you to call SGE directly from R and suppress the forth and back

  2. future is quite general and can handle many back-ends

  3. You need to specify the back-end with plan. Here are some examples:

library(future)
library(future.batchtools)
plan(sequential) ## R as usual
plan(multiprocess) ## Use many cores on the same machines
plan(batchtools_sge) ## Use sge via the future.batchtools package

But first you need to setup a configuration file.

18 / 26
cat ~/.batchools.sge.tmpl ## on Migale
#!/bin/bash
## The name of the job
#$ -N <%= job.name %>
## Combining output/error messages into one file
#$ -j y
## Giving the name of the output log file
#$ -o <%= log.file %>
## Execute the script in the working directory
#$ -cwd
## Use environment variables
#$ -V
## Use multithreading
#$ -pe threads <%= resources$threads %>
## Use correct queue
#$ -q <%= resources$queue %>
## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>
<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
exit 0
19 / 26

Time to make a plan

library(future.batchtools)
plan(batchtools_sge,
workers = 10, ## nombre maximum de jobs en //,
## non limité par défaut
template = "~/.batchtools.sge.tmpl", ## template sge, inutile ici
## car lu par défaut
resources = list( ## Paramètres définis à la volée
queue = "short.q", ## queue à utiliser
threads = 1 ## Nombre de cores par nodes
)
)
20 / 26

Working with future

Simple drop-in in most scripts.

  • replace vector and/or list with listenv
  • replace <- with %<-%
  • transform listenv to list and/or vector
21 / 26

Working with future

Simple drop-in in most scripts.

  • replace vector and/or list with listenv
  • replace <- with %<-%
  • transform listenv to list and/or vector
library(listenv)
## Setup a listenv (special kind of list)
results <- listenv()
create <- function(mu = 0) { rnorm(1000, mean = mu) }
analyze <- function(x) { mean(x) }
for (i in 1:10) {
results[[i]] %<-% analyze(create())
}
results <- unlist(as.list(results)) ## stalled until the end of
## all computations
21 / 26

Level 5: Using furrr

furrr = purrr + future

22 / 26

purrr: Iteration made easy

  • The map and map_* family of functions superbly replace loops.

Writing our previous example with purrr would give

library(purrr); library(dplyr)
create <- function(mu = 0) { rnorm(1000, mean = mu) }
analyze <- function(x) { mean(x) }
results <- tibble(
i = 1:10,
mu = rep(0, length(i)),
result = map_dbl(mu, ~ analyze(create(.x)))
)
results
## # A tibble: 10 x 3
## i mu result
## <int> <dbl> <dbl>
## 1 1 0 -0.0247
## 2 2 0 0.0197
## 3 3 0 0.0222
## 4 4 0 -0.0295
## 5 5 0 -0.0391
## 6 6 0 -0.0323
## 7 7 0 0.0139
## 8 8 0 -0.0327
## 9 9 0 0.0485
## 10 10 0 -0.00247
23 / 26

Furrr: when future meets purrr

  1. Furrr provides future_map_*() as drop-in alternatives to map_*() functions.

  2. You just need to have a plan (as with future)

library(furrr)
library(purrr)
library(dplyr)
plan(multiprocess) ## Or plan(batchtool_sge)
## Warning: [ONE-TIME WARNING] Forked processing ('multicore') is disabled
## in future (>= 1.13.0) when running R from RStudio, because it is
## considered unstable. Because of this, plan("multicore") will fall
## back to plan("sequential"), and plan("multiprocess") will fall back to
## plan("multisession") - not plan("multicore") as in the past. For more details,
## how to control forked processing or not, and how to silence this warning in
## future R sessions, see ?future::supportsMulticore
create <- function(mu = 0) { rnorm(1000, mean = mu) }
analyze <- function(x) { mean(x) }
results <- tibble(
i = 1:10,
mu = rep(0, length(i)),
result = future_map_dbl(mu, ~ analyze(create(.x)))
)
results
## # A tibble: 10 x 3
## i mu result
## <int> <dbl> <dbl>
## 1 1 0 0.0234
## 2 2 0 -0.0120
## 3 3 0 -0.0287
## 4 4 0 -0.00791
## 5 5 0 0.0134
## 6 6 0 0.0167
## 7 7 0 0.0132
## 8 8 0 0.00816
## 9 9 0 0.0147
## 10 10 0 -0.0165
24 / 26

Going further

You can produce back-ends that spawn multiple jobs, each of which uses multiple cores.

plan(list(
tweak(batchtools_sge, resources = list(queue = "short.q", threads = 10)),
tweak(multiprocess, workers = 10)
))

Note the workers option for multiprocess.

  • This is good practice.

  • Manually specify the number of threads to use when going mutliprocess. Otherwise, R will use all available cores and other people will hate you 😡

25 / 26
26 / 26

Welcome to the Migale platform!

2 / 26
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow