--
--
create <- function(mu = 0) { rnorm(100, mean = mu) }analyze <- function(x) { mean(x) }results <- numeric(100)for (i in 1:100) { results[i] <- analyze(create())}saveRDS(results, "results.rds")
cat scripts/my_script_1.R
## create <- function(mu = 0) { rnorm(100, mean = mu) }## analyze <- function(x) { mean(x) }## results <- numeric(100)## for (i in 1:100) {## results[i] <- analyze(create())## }## saveRDS(results, "results.rds")Transfer scripts and data to migale1
Connect to migale
Run script
Work on the results
scp -r . mmariadasso@migale:SOTR/
ssh mmariadasso@migale
Rscript my_script_1.R
--
To make life easier add
Host migaleUser mmariadassoHostName migale.jouy.inra.fr
to your ~/.ssh/config file.
Transfer scripts and data to migale1
Connect to migale
Run script
Work on the results
scp -r . mmariadasso@migale:SOTR/
ssh mmariadasso@migale
Rscript my_script_1.R
--
To make life easier add
Host migaleUser mmariadassoHostName migale.jouy.inra.fr
to your ~/.ssh/config file.
[1] You may need to expand migale to migale.jouy.inra.fr
Sun Grid Engine is a job scheduler. You submit many jobs on the front-node and sge will dispatch them to the computer farm.
A long introduction to sge can be found here but here is a simple example
RSCRIPT=/usr/local/public/R/bin/RscriptRJOB="my_script_1.R"qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae $RJOB
RSCRIPT and RJOB are environment variables and are expanded in the final call.
Here we need to specify -S $RSCRIPT to make sure that the instructions in my_script_1.R are executed with R.
Let's unpack the options:
-cwd run in current working directory-V will pass all environment variables to the job-N <jobname> name of the job. This you will see when you use qstat, to check status of your jobs.-b y allow command to be a binary file instead of a script.Other usefull options are:
-q <queue> set the queue. See here to choose a queue (short / long / bigmem / etc ) adapted to your needs.-pe thread <n_slots> This specifies the parallel environment. thread runs a parallel job using shared-memory and n_processors amount of cores.-R y allows to reserve resources as soon as they are free-o <output_logfile> name of the output log file-e <error_logfile> name of the error log file-m bea Will send email when job begins, ends or aborts-M <emailaddress> Email address to send email toWe're still only using one node at the time !!
cat scripts/my_script_2.R
## Argumentsargs <- commandArgs(trailingOnly = TRUE)id <- as.integer(args[1])## Computationscreate <- function(mu = 0) { rnorm(100, mean = mu) }analyze <- function(x) { mean(x) }result <- analyze(create())## ResultssaveRDS(object = result, file = paste0("result_", id, ".rds"))qsub repeatedlyRSCRIPT="/usr/local/public/R/bin/Rscript"RJOB="my_script_2.R"QSUB="qsub -S $RSCRIPT -q short.q -cwd -V -M me@mail.com -m bae"seq 1 100 | xargs -I {} $QSUB $RJOB {}
This is equivalent to
$QSUB $RJOB 1...$QSUB $RJOB 100
results <- numeric(100)for (i in 1:100) results[i] <- readRDS(paste0("result_", i, ".rds"))Use qstat on migale to monitor the state of your jobs: qw (waiting), Eqw (error), t (transferring), r (running)
➕ Quite easy if you want to parallelize loops (simulations)
➕ uses many machines (!= many cores on a machine)
➖ Forth and back between R and bash
➖ Not perfect for numerical experiments (machines are not perfectly similar)
futurefuture allows you to call SGE directly from R and suppress the forth and back
future is quite general and can handle many back-ends
You need to specify the back-end with plan. Here are some examples:
library(future)library(future.batchtools)plan(sequential) ## R as usualplan(multiprocess) ## Use many cores on the same machinesplan(batchtools_sge) ## Use sge via the future.batchtools package
But first you need to setup a configuration file.
cat ~/.batchools.sge.tmpl ## on Migale
#!/bin/bash## The name of the job#$ -N <%= job.name %>## Combining output/error messages into one file#$ -j y## Giving the name of the output log file#$ -o <%= log.file %>## Execute the script in the working directory#$ -cwd## Use environment variables#$ -V## Use multithreading#$ -pe threads <%= resources$threads %>## Use correct queue#$ -q <%= resources$queue %>## Export value of DEBUGME environemnt var to slaveexport DEBUGME=<%= Sys.getenv("DEBUGME") %><%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%><%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%><%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>Rscript -e 'batchtools::doJobCollection("<%= uri %>")'exit 0
library(future.batchtools)plan(batchtools_sge, workers = 10, ## nombre maximum de jobs en //, ## non limité par défaut template = "~/.batchtools.sge.tmpl", ## template sge, inutile ici ## car lu par défaut resources = list( ## Paramètres définis à la volée queue = "short.q", ## queue à utiliser threads = 1 ## Nombre de cores par nodes ) )Simple drop-in in most scripts.
vector and/or list with listenv<- with %<-%listenv to list and/or vectorSimple drop-in in most scripts.
vector and/or list with listenv<- with %<-%listenv to list and/or vectorlibrary(listenv)## Setup a listenv (special kind of list)results <- listenv()create <- function(mu = 0) { rnorm(1000, mean = mu) }analyze <- function(x) { mean(x) }for (i in 1:10) { results[[i]] %<-% analyze(create())}results <- unlist(as.list(results)) ## stalled until the end of ## all computationsfurrrfurrr = purrr + futurepurrr: Iteration made easymap and map_* family of functions superbly replace loops.Writing our previous example with purrr would give
library(purrr); library(dplyr)create <- function(mu = 0) { rnorm(1000, mean = mu) }analyze <- function(x) { mean(x) }results <- tibble( i = 1:10, mu = rep(0, length(i)), result = map_dbl(mu, ~ analyze(create(.x))))results
## # A tibble: 10 x 3## i mu result## <int> <dbl> <dbl>## 1 1 0 -0.0247 ## 2 2 0 0.0197 ## 3 3 0 0.0222 ## 4 4 0 -0.0295 ## 5 5 0 -0.0391 ## 6 6 0 -0.0323 ## 7 7 0 0.0139 ## 8 8 0 -0.0327 ## 9 9 0 0.0485 ## 10 10 0 -0.00247Furrr: when future meets purrrFurrr provides future_map_*() as drop-in alternatives to map_*() functions.
You just need to have a plan (as with future)
library(furrr)library(purrr)library(dplyr)plan(multiprocess) ## Or plan(batchtool_sge)
## Warning: [ONE-TIME WARNING] Forked processing ('multicore') is disabled## in future (>= 1.13.0) when running R from RStudio, because it is## considered unstable. Because of this, plan("multicore") will fall## back to plan("sequential"), and plan("multiprocess") will fall back to## plan("multisession") - not plan("multicore") as in the past. For more details,## how to control forked processing or not, and how to silence this warning in## future R sessions, see ?future::supportsMulticorecreate <- function(mu = 0) { rnorm(1000, mean = mu) }analyze <- function(x) { mean(x) }results <- tibble( i = 1:10, mu = rep(0, length(i)), result = future_map_dbl(mu, ~ analyze(create(.x))))results
## # A tibble: 10 x 3## i mu result## <int> <dbl> <dbl>## 1 1 0 0.0234 ## 2 2 0 -0.0120 ## 3 3 0 -0.0287 ## 4 4 0 -0.00791## 5 5 0 0.0134 ## 6 6 0 0.0167 ## 7 7 0 0.0132 ## 8 8 0 0.00816## 9 9 0 0.0147 ## 10 10 0 -0.0165You can produce back-ends that spawn multiple jobs, each of which uses multiple cores.
plan(list( tweak(batchtools_sge, resources = list(queue = "short.q", threads = 10)), tweak(multiprocess, workers = 10)))
Note the workers option for multiprocess.
This is good practice.
Manually specify the number of threads to use when going mutliprocess. Otherwise, R will use all available cores and other people will hate you 😡
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |