Using R on MigaleA short trip through purrr, future and furrrMahendra MariadassouINRAE - MaIAGE, Migale Team2020-03-12 (updated: 2020-03-14)1 / 26

Welcome to the Migale platform!

2 / 26

3 / 26

Prerequisites:

an account¹: request one using the form
Working knowledge of unix²

[1] Requires an academic e-mail adress

[2] See here for a simple intro

4 / 26

ArchitectureMigale is the front node5 / 26

Architecture

Migale is the front node

Which connects you to the computer farm

Tens of nodes (~ machines) representing hundred of cores (~ CPU)

5 / 26

Level 1: Rstudio server6 / 26

Connect to migale Rstudio server

Available at https://rstudio.migale.inrae.fr/

Super easy ☺️ but runs on a Virtual Machine 😞

7 / 26

Level 2: bash mode8 / 26

Prepare your R script

create <- function(mu = 0) { rnorm(100, mean = mu) }
analyze  <- function(x) { mean(x) }
results <- numeric(100)
for (i in 1:100) {
  results[i] <- analyze(create())
}
saveRDS(results, "results.rds")

cat scripts/my_script_1.R

## create <- function(mu = 0) { rnorm(100, mean = mu) }
## analyze  <- function(x) { mean(x) }
## results <- numeric(100)
## for (i in 1:100) {
##   results[i] <- analyze(create())
## }
## saveRDS(results, "results.rds")

9 / 26

Running you script

Transfer scripts and data to migale¹
Connect to migale
Run script
Work on the results

scp -r . mmariadasso@migale:SOTR/

ssh mmariadasso@migale

Rscript my_script_1.R

To make life easier add

Host migale
User mmariadasso
HostName migale.jouy.inra.fr

to your ~/.ssh/config file.

10 / 26

Running you script

Transfer scripts and data to migale¹
Connect to migale
Run script
Work on the results

scp -r . mmariadasso@migale:SOTR/

ssh mmariadasso@migale

Rscript my_script_1.R

To make life easier add

Host migale
User mmariadasso
HostName migale.jouy.inra.fr

to your ~/.ssh/config file.

Quite easy ☺️ but uses only the front-node 😞

[1] You may need to expand migale to migale.jouy.inra.fr

10 / 26

Level 3: Using sge11 / 26

About SGE

Sun Grid Engine is a job scheduler. You submit many jobs on the front-node and sge will dispatch them to the computer farm.

A long introduction to sge can be found here but here is a simple example

RSCRIPT=/usr/local/public/R/bin/Rscript
RJOB="my_script_1.R"
qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae $RJOB

RSCRIPT and RJOB are environment variables and are expanded in the final call.

Here we need to specify -S $RSCRIPT to make sure that the instructions in my_script_1.R are executed with R.

12 / 26

SGE options

Let's unpack the options:

-cwd run in current working directory
-V will pass all environment variables to the job
-N <jobname> name of the job. This you will see when you use qstat, to check status of your jobs.
-b y allow command to be a binary file instead of a script.

Other usefull options are:

-q <queue> set the queue. See here to choose a queue (short / long / bigmem / etc ) adapted to your needs.
-pe thread <n_slots> This specifies the parallel environment. thread runs a parallel job using shared-memory and n_processors amount of cores.
-R y allows to reserve resources as soon as they are free
-o <output_logfile> name of the output log file
-e <error_logfile> name of the error log file
-m bea Will send email when job begins, ends or aborts
-M <emailaddress> Email address to send email to

13 / 26

Leveraging the computer farm (I)

We're still only using one node at the time !!

Decompose your script and pass arguments

cat scripts/my_script_2.R

## Arguments
args <- commandArgs(trailingOnly = TRUE)
id <- as.integer(args[1])
## Computations
create <- function(mu = 0) { rnorm(100, mean = mu) }
analyze  <- function(x) { mean(x) }
result <- analyze(create())
## Results
saveRDS(object = result, file = paste0("result_", id, ".rds"))

14 / 26

Leveraging the computer farm (II)

Use `qsub` repeatedly

RSCRIPT="/usr/local/public/R/bin/Rscript"
RJOB="my_script_2.R"
QSUB="qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae"
seq 1 100 | xargs -I {} $QSUB $RJOB {}

This is equivalent to

$QSUB $RJOB 1
...
$QSUB $RJOB 100

Aggregate all the results at the end

results <- numeric(100)
for (i in 1:100) results[i] <- readRDS(paste0("result_", i, ".rds"))

15 / 26

Monitoring your jobs

Use qstat on migale to monitor the state of your jobs: qw (waiting), Eqw (error), t (transferring), r (running)

Some pros and cons

➕ Quite easy if you want to parallelize loops (simulations)

➕ uses many machines (!= many cores on a machine)

➖ Forth and back between R and bash

➖ Not perfect for numerical experiments (machines are not perfectly similar)

16 / 26

Level 4: Using future17 / 26

Future/Future.batchtools package

future allows you to call SGE directly from R and suppress the forth and back
future is quite general and can handle many back-ends
You need to specify the back-end with plan. Here are some examples:

library(future)
library(future.batchtools)
plan(sequential)    ## R as usual
plan(multiprocess)  ## Use many cores on the same machines
plan(batchtools_sge) ## Use sge via the future.batchtools package

But first you need to setup a configuration file.

18 / 26

cat ~/.batchools.sge.tmpl ## on Migale

#!/bin/bash
## The name of the job
#$ -N <%= job.name %>
## Combining output/error messages into one file
#$ -j y
## Giving the name of the output log file
#$ -o <%= log.file %>
## Execute the script in the working directory
#$ -cwd
## Use environment variables
#$ -V
## Use multithreading
#$ -pe threads <%= resources$threads %>
## Use correct queue
#$ -q <%= resources$queue %>
## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>
<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
exit 0

19 / 26

Time to make a plan

library(future.batchtools)
plan(batchtools_sge, 
     workers = 10,                         ## nombre maximum de jobs en //,
                                           ## non limité par défaut
     template = "~/.batchtools.sge.tmpl",  ## template sge, inutile ici
                                           ## car lu par défaut 
     resources = list(                     ## Paramètres définis à la volée
       queue   = "short.q",   ## queue à utiliser
       threads = 1            ## Nombre de cores par nodes
     )   
)

20 / 26

Working with future

Simple drop-in in most scripts.

replace vector and/or list with listenv
replace <- with %<-%
transform listenv to list and/or vector

21 / 26

Working with future

Simple drop-in in most scripts.

replace vector and/or list with listenv
replace <- with %<-%
transform listenv to list and/or vector

library(listenv)
## Setup a listenv (special kind of list)
results <- listenv()
create <- function(mu = 0) { rnorm(1000, mean = mu) }
analyze <- function(x) { mean(x) }
for (i in 1:10) { 
  results[[i]] %<-% analyze(create())
}
results <- unlist(as.list(results)) ## stalled until the end of
                                    ## all computations

21 / 26

Level 5: Using furrrfurrr = purrr + future22 / 26

`purrr`: Iteration made easy

The map and map_* family of functions superbly replace loops.

Writing our previous example with purrr would give

library(purrr); library(dplyr)
create  <- function(mu = 0) { rnorm(1000, mean = mu) }
analyze <- function(x) { mean(x) }
results <- tibble(
  i      = 1:10,
  mu     = rep(0, length(i)),
  result = map_dbl(mu, ~ analyze(create(.x)))
)
results

## # A tibble: 10 x 3
##        i    mu   result
##    <int> <dbl>    <dbl>
##  1     1     0 -0.0247 
##  2     2     0  0.0197 
##  3     3     0  0.0222 
##  4     4     0 -0.0295 
##  5     5     0 -0.0391 
##  6     6     0 -0.0323 
##  7     7     0  0.0139 
##  8     8     0 -0.0327 
##  9     9     0  0.0485 
## 10    10     0 -0.00247

23 / 26

`Furrr`: when `future` meets `purrr`

Furrr provides future_map_*() as drop-in alternatives to map_*() functions.
You just need to have a plan (as with future)

library(furrr)
library(purrr)
library(dplyr)
plan(multiprocess) ## Or plan(batchtool_sge)

## Warning: [ONE-TIME WARNING] Forked processing ('multicore') is disabled
## in future (>= 1.13.0) when running R from RStudio, because it is
## considered unstable. Because of this, plan("multicore") will fall
## back to plan("sequential"), and plan("multiprocess") will fall back to
## plan("multisession") - not plan("multicore") as in the past. For more details,
## how to control forked processing or not, and how to silence this warning in
## future R sessions, see ?future::supportsMulticore

create  <- function(mu = 0) { rnorm(1000, mean = mu) }
analyze <- function(x) { mean(x) }
results <- tibble(
  i      = 1:10,
  mu     = rep(0, length(i)),
  result = future_map_dbl(mu, ~ analyze(create(.x)))
)
results

## # A tibble: 10 x 3
##        i    mu   result
##    <int> <dbl>    <dbl>
##  1     1     0  0.0234 
##  2     2     0 -0.0120 
##  3     3     0 -0.0287 
##  4     4     0 -0.00791
##  5     5     0  0.0134 
##  6     6     0  0.0167 
##  7     7     0  0.0132 
##  8     8     0  0.00816
##  9     9     0  0.0147 
## 10    10     0 -0.0165

24 / 26

Going further

You can produce back-ends that spawn multiple jobs, each of which uses multiple cores.

plan(list(
  tweak(batchtools_sge, resources = list(queue = "short.q", threads = 10)), 
  tweak(multiprocess, workers = 10)
))

Note the workers option for multiprocess.

This is good practice.
Manually specify the number of threads to use when going mutliprocess. Otherwise, R will use all available cores and other people will hate you 😡

25 / 26

26 / 26

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Using R on Migale

A short trip through purrr, future and furrr

Mahendra Mariadassou

INRAE - MaIAGE, Migale Team

2020-03-12 (updated: 2020-03-14)

Welcome to the Migale platform!

Prerequisites:

an account1: request one using the form

Working knowledge of unix2

Architecture

Migale is the front node

Architecture

Migale is the front node

Which connects you to the computer farm

Tens of nodes (~ machines) representing hundred of cores (~ CPU)

Level 1: Rstudio server

Connect to migale Rstudio server

Available at https://rstudio.migale.inrae.fr/

Super easy ☺️ but runs on a Virtual Machine 😞

Level 2: bash mode

Prepare your R script

Running you script

Running you script

Quite easy ☺️ but uses only the front-node 😞

Level 3: Using sge

About SGE

SGE options

Leveraging the computer farm (I)

Decompose your script and pass arguments

Leveraging the computer farm (II)

Use qsub repeatedly

Aggregate all the results at the end

Monitoring your jobs

Some pros and cons

Level 4: Using future

Future/Future.batchtools package

Time to make a plan

Working with future

Working with future

Level 5: Using furrr

furrr = purrr + future

purrr: Iteration made easy

Furrr: when future meets purrr

Going further

Welcome to the Migale platform!

Help

an account¹: request one using the form

Working knowledge of unix²

Use `qsub` repeatedly

Level 4: Using `future`

Level 5: Using `furrr`

`furrr` = `purrr` + `future`

`purrr`: Iteration made easy

`Furrr`: when `future` meets `purrr`