--
--
create <- function(mu = 0) { rnorm(100, mean = mu) }analyze <- function(x) { mean(x) }results <- numeric(100)for (i in 1:100) { results[i] <- analyze(create())}saveRDS(results, "results.rds")
cat scripts/my_script_1.R
## create <- function(mu = 0) { rnorm(100, mean = mu) }## analyze <- function(x) { mean(x) }## results <- numeric(100)## for (i in 1:100) {## results[i] <- analyze(create())## }## saveRDS(results, "results.rds")
Transfer scripts and data to migale1
Connect to migale
Run script
Work on the results
scp -r . mmariadasso@migale:SOTR/
ssh mmariadasso@migale
Rscript my_script_1.R
--
To make life easier add
Host migaleUser mmariadassoHostName migale.jouy.inra.fr
to your ~/.ssh/config
file.
Transfer scripts and data to migale1
Connect to migale
Run script
Work on the results
scp -r . mmariadasso@migale:SOTR/
ssh mmariadasso@migale
Rscript my_script_1.R
--
To make life easier add
Host migaleUser mmariadassoHostName migale.jouy.inra.fr
to your ~/.ssh/config
file.
[1] You may need to expand migale
to migale.jouy.inra.fr
Sun Grid Engine is a job scheduler. You submit many jobs on the front-node and sge will dispatch them to the computer farm.
A long introduction to sge can be found here but here is a simple example
RSCRIPT=/usr/local/public/R/bin/RscriptRJOB="my_script_1.R"qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae $RJOB
RSCRIPT
and RJOB
are environment variables and are expanded in the final call.
Here we need to specify -S $RSCRIPT
to make sure that the instructions in my_script_1.R
are executed with R.
Let's unpack the options:
-cwd
run in current working directory-V
will pass all environment variables to the job-N <jobname>
name of the job. This you will see when you use qstat
, to check status of your jobs.-b y
allow command to be a binary file instead of a script.Other usefull options are:
-q <queue>
set the queue. See here to choose a queue (short / long / bigmem / etc ) adapted to your needs.-pe thread <n_slots>
This specifies the parallel environment. thread runs a parallel job using shared-memory and n_processors amount of cores.-R y
allows to reserve resources as soon as they are free-o <output_logfile>
name of the output log file-e <error_logfile>
name of the error log file-m bea
Will send email when job begins, ends or aborts-M <emailaddress>
Email address to send email toWe're still only using one node at the time !!
cat scripts/my_script_2.R
## Argumentsargs <- commandArgs(trailingOnly = TRUE)id <- as.integer(args[1])## Computationscreate <- function(mu = 0) { rnorm(100, mean = mu) }analyze <- function(x) { mean(x) }result <- analyze(create())## ResultssaveRDS(object = result, file = paste0("result_", id, ".rds"))
qsub
repeatedlyRSCRIPT="/usr/local/public/R/bin/Rscript"RJOB="my_script_2.R"QSUB="qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae"seq 1 100 | xargs -I {} $QSUB $RJOB {}
This is equivalent to
$QSUB $RJOB 1...$QSUB $RJOB 100
results <- numeric(100)for (i in 1:100) results[i] <- readRDS(paste0("result_", i, ".rds"))
Use qstat
on migale to monitor the state of your jobs: qw
(waiting), Eqw
(error), t
(transferring), r
(running)
➕ Quite easy if you want to parallelize loops (simulations)
➕ uses many machines (!= many cores on a machine)
➖ Forth and back between R
and bash
➖ Not perfect for numerical experiments (machines are not perfectly similar)
future
future
allows you to call SGE directly from R and suppress the forth and back
future
is quite general and can handle many back-ends
You need to specify the back-end with plan
. Here are some examples:
library(future)library(future.batchtools)plan(sequential) ## R as usualplan(multiprocess) ## Use many cores on the same machinesplan(batchtools_sge) ## Use sge via the future.batchtools package
But first you need to setup a configuration file.
cat ~/.batchools.sge.tmpl ## on Migale
#!/bin/bash## The name of the job#$ -N <%= job.name %>## Combining output/error messages into one file#$ -j y## Giving the name of the output log file#$ -o <%= log.file %>## Execute the script in the working directory#$ -cwd## Use environment variables#$ -V## Use multithreading#$ -pe threads <%= resources$threads %>## Use correct queue#$ -q <%= resources$queue %>## Export value of DEBUGME environemnt var to slaveexport DEBUGME=<%= Sys.getenv("DEBUGME") %><%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%><%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%><%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>Rscript -e 'batchtools::doJobCollection("<%= uri %>")'exit 0
library(future.batchtools)plan(batchtools_sge, workers = 10, ## nombre maximum de jobs en //, ## non limité par défaut template = "~/.batchtools.sge.tmpl", ## template sge, inutile ici ## car lu par défaut resources = list( ## Paramètres définis à la volée queue = "short.q", ## queue à utiliser threads = 1 ## Nombre de cores par nodes ) )
Simple drop-in in most scripts.
vector
and/or list
with listenv
<-
with %<-%
listenv
to list
and/or vector
Simple drop-in in most scripts.
vector
and/or list
with listenv
<-
with %<-%
listenv
to list
and/or vector
library(listenv)## Setup a listenv (special kind of list)results <- listenv()create <- function(mu = 0) { rnorm(1000, mean = mu) }analyze <- function(x) { mean(x) }for (i in 1:10) { results[[i]] %<-% analyze(create())}results <- unlist(as.list(results)) ## stalled until the end of ## all computations
furrr
furrr
= purrr
+ future
purrr
: Iteration made easymap
and map_*
family of functions superbly replace loops.Writing our previous example with purrr
would give
library(purrr); library(dplyr)create <- function(mu = 0) { rnorm(1000, mean = mu) }analyze <- function(x) { mean(x) }results <- tibble( i = 1:10, mu = rep(0, length(i)), result = map_dbl(mu, ~ analyze(create(.x))))results
## # A tibble: 10 x 3## i mu result## <int> <dbl> <dbl>## 1 1 0 -0.0247 ## 2 2 0 0.0197 ## 3 3 0 0.0222 ## 4 4 0 -0.0295 ## 5 5 0 -0.0391 ## 6 6 0 -0.0323 ## 7 7 0 0.0139 ## 8 8 0 -0.0327 ## 9 9 0 0.0485 ## 10 10 0 -0.00247
Furrr
: when future
meets purrr
Furrr provides future_map_*()
as drop-in alternatives to map_*()
functions.
You just need to have a plan
(as with future
)
library(furrr)library(purrr)library(dplyr)plan(multiprocess) ## Or plan(batchtool_sge)
## Warning: [ONE-TIME WARNING] Forked processing ('multicore') is disabled## in future (>= 1.13.0) when running R from RStudio, because it is## considered unstable. Because of this, plan("multicore") will fall## back to plan("sequential"), and plan("multiprocess") will fall back to## plan("multisession") - not plan("multicore") as in the past. For more details,## how to control forked processing or not, and how to silence this warning in## future R sessions, see ?future::supportsMulticore
create <- function(mu = 0) { rnorm(1000, mean = mu) }analyze <- function(x) { mean(x) }results <- tibble( i = 1:10, mu = rep(0, length(i)), result = future_map_dbl(mu, ~ analyze(create(.x))))results
## # A tibble: 10 x 3## i mu result## <int> <dbl> <dbl>## 1 1 0 0.0234 ## 2 2 0 -0.0120 ## 3 3 0 -0.0287 ## 4 4 0 -0.00791## 5 5 0 0.0134 ## 6 6 0 0.0167 ## 7 7 0 0.0132 ## 8 8 0 0.00816## 9 9 0 0.0147 ## 10 10 0 -0.0165
You can produce back-ends that spawn multiple jobs, each of which uses multiple cores.
plan(list( tweak(batchtools_sge, resources = list(queue = "short.q", threads = 10)), tweak(multiprocess, workers = 10)))
Note the workers
option for multiprocess
.
This is good practice.
Manually specify the number of threads to use when going mutliprocess. Otherwise, R will use all available cores and other people will hate you 😡
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |