Overhead when exporting to PSOCK cluster


**Describe the bug**
I believe that a significant (and avoidable) overhead is present when using future_lapply inside another function. I think this might be related to the exportation of `...future.FUN`, which is serialized with its enclosing environment.

I experience this issue on Windows and had not the chance to give it a try on macOS. I suspect this is only relevant to Windows.

**Expected behavior**
The overhead in deploying a function on multiple cores should remain limited.


**Reproduce example**
Consider the following (lengthy) example, where I apply ` identity()` to a large list of characters.

``` r
long_characters <- setNames(replicate(
  10000,
  paste0(c(letters, rep(" ", 10))[
    sample.int(length(letters) + 1, 5000, replace = TRUE)],
    collapse = ""),
  simplify = FALSE
), paste0("element", 1:10000))
object.size(long_characters) |> format("Mb")
## [1] "49.5 Mb"

some_function1 <- function(x) {
  identity(x)
}
some_function2 <- function(x) {
  future_lapply(x, function(y) {
    identity(y)
  })
}
```
The second function is nesting 'future_lapply' and this appears to have important implications regarding the parallel processing overhead. I quickly benchmark the functions for a sequential backend.
``` r
library(future)
library(future.apply)
library(foreach); library(doFuture); registerDoFuture()
plan(sequential)
benchmark <- bench::mark(
  `1` = future_lapply(long_characters, identity),
  `2` = future_lapply(long_characters, some_function1),
  `3` = some_function2(long_characters),
  `4` = foreach(x = long_characters, .final = function(x) setNames(x, names(long_characters))
  ) %dopar% {some_function1(x)},
  filter_gc = FALSE,
  min_iterations = 5
)
benchmark[, c(1:5, 7)]
# # A tibble: 4 x 6
# expression        min   median `itr/sec` mem_alloc n_itr
# <bch:expr>   <bch:tm> <bch:tm>    <dbl>  <bch:byt> <int>
# 1 1            22.1ms  23.08ms    41.7      4.29MB    21
# 2 2           25.89ms  26.69ms    36.0   1006.98KB    18
# 3 3            26.5ms  27.39ms    34.2   1006.48KB    18
# 4 4             1.21s    1.24s     0.809   53.67MB     5
``` 
All apply functions are more or less similar, and foreach is slower, apparently due to some memory allocation. The results change drastically once a `multisession` is registered:
``` r
plan(multisession, workers = 2)
benchmark <- bench::mark(
  `1` = future_lapply(long_characters, identity),
  `2` = future_lapply(long_characters, some_function1),
  `3` = some_function2(long_characters),
  `4` = foreach(x = long_characters, .final = function(x) setNames(x, names(long_characters))
  ) %dopar% {some_function1(x)},
  filter_gc = FALSE,
  min_iterations = 5
)
benchmark[, c(1:5, 7)]
# # A tibble: 4 x 6
# expression        min   median `itr/sec` mem_alloc n_itr
# <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt> <int>
# 1 1           574.9ms 579.89ms     1.57     1.91MB     5
# 2 2          568.84ms 570.91ms     1.68     1.83MB     5
# 3 3          861.42ms    1.16s     0.910    1.84MB     5
# 4 4             1.37s    1.44s     0.702   53.66MB     5
``` 
Here the nested call of `future_lapply()` inside `some_function2()` appears to suffer compared to the first two alternatives. The foreach option is relatively unaffected by the changed plan.

The gap between the nested and non-nested  `future_lapply()` increases as the number of cores rises: 
``` r
plan(multisession, workers = 10)
benchmark <- bench::mark(
  `1` = future_lapply(long_characters, identity),
  `2` = future_lapply(long_characters, some_function1),
  `3` = some_function2(long_characters),
  `4` = foreach(x = long_characters, .final = function(x) setNames(x, names(long_characters))
  ) %dopar% {some_function1(x)},
  filter_gc = FALSE,
  min_iterations = 5
)
benchmark[, c(1:5, 7)]
# # A tibble: 4 x 6
# expression        min   median `itr/sec` mem_alloc n_itr
# <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt> <int>
# 1 1          880.32ms     1.5s     0.724    4.17MB     5
# 2 2          580.24ms    1.21s     0.977    4.09MB     5
# 3 3             4.45s    4.75s     0.208    4.17MB     5
# 4 4              1.4s    1.42s     0.568   55.48MB     5
``` 
Now `future_lapply()` is even slower than foreach! Foreach is, in contrast with the other options, barely affected by the increase in the number of processes.


To investigate the matter further, I implemented a *very* dirty replacement of the internal function `sendData.SOCK0node()` of the `parallel` package. Doing so, I added some lines to write on a logging file the time elapsed for each exportation to a cluster node.
 I know that `future` already implements many monitoring tools when setting `options(future.debug = TRUE)`, but I felt something off when observing the log of exported values. Instead, I implemented this logging system using `pryr::pryr::object_size()`, supposed to be more reliable than `utils::object.size()`. The results suggest that the enclosing environment of `future_lapply()` is unnecessarily exported to each node.

Here is the dirty replacement to the `parallel` function.
``` r
sendData.SOCK0node <- function(node, data) {
  start <- Sys.time()
  on.exit({
    elapsed <- Sys.time() - start
    res <- data.frame(
      Callnr = tracker,
      Elapsed = elapsed,
      Size = format(pryr::object_size(data)),
      Object = NA
    )
    if (length(data$data$args) > 0) res$Object <- as.character(data$data$args[[1]][1])
    write.table(
      res, file = "log.txt", append = TRUE, col.names = FALSE, sep = ",",
      row.names = FALSE
    )
    tracker <<- tracker + 1
  })
  serialize(data, node$con, xdr = FALSE)
}
unlockBinding("sendData.SOCK0node", env = getNamespace("parallel"))
assignInNamespace("sendData.SOCK0node", sendData.SOCK0node, "parallel")
# Error in get(S3[i, 1L], mode = "function", envir = parent.frame()) : 
#   object 'sendData' of mode 'function' was not found
parallel:::sendData.SOCK0node
# function(node, data) {
#   start <- Sys.time()
#   on.exit({
#     elapsed <- Sys.time() - start
#     res <- data.frame(
#       Callnr = tracker,
#       Elapsed = elapsed,
#       Size = format(pryr::object_size(data)),
#       Object = NA
#     )
#     if (length(data$data$args) > 0) res$Object <- as.character(data$data$args[[1]][1])
#     write.table(
#       res, file = "log.txt", append = TRUE, col.names = FALSE, sep = ",",
#       row.names = FALSE
#     )
#     tracker <<- tracker + 1
#   })
#   serialize(data, node$con, xdr = FALSE)
# }
``` 
Although this throw an error, it was enough (for me at least) to replace the existing function in the `parallel` package. I know this seems barbaric, but my limited knowledge of `future`'s internals led me to this.

Hence, I was able to observe the data transferred to each node by monitoring the size with `pryr::object_size()` for the nested `future_lapply()` call.
``` r
tracker <- 1
cat(r"("Callnr", "Elapsed", "Size", "Object")", "\n", file = "log.txt", sep = "")
plan(multisession, workers = 2)
some_function2(long_characters)
res <- read.csv("log.txt")
res
#      Callnr      Elapsed     Size                    Object
#   1       1 0.0004999638      552                      <NA>
#   2       2 0.0005021095      552                      <NA>
#   3       3 0.0000000000      552                      <NA>
#   4       4 0.0000000000      552                      <NA>
#   5       5 0.0000000000      552                      <NA>
#   6       6 0.0000000000      552                      <NA>
#   7       7 0.0000000000      552                      <NA>
#   8       8 0.0000000000      552                      <NA>
#   9       9 0.0000000000      552                      <NA>
#   10     10 0.0000000000      552                      <NA>
#   11     11 0.0000000000     3936                Sys.getpid
#   12     12 0.0000000000     3936                Sys.getpid
#   13     13 0.0000000000     8328                     FALSE
#   14     14 0.0000000000     3936                Sys.getpid
#   15     15 0.0000000000     8328                     FALSE
#   16     16 0.0000000000     3936                Sys.getpid
#   17     17 0.0000000000     5512                      <NA>
#   18     18 0.0000000000   115456                         {
#   19     19 0.0000000000     5512                      <NA>
#   20     20 0.1896619797 51929376             ...future.FUN
#   21     21 0.0000000000     5232     future.call.arguments
#   22     22 0.0695588589 25965096     ...future.elements_ii
#   23     23 0.0000000000     4944        ...future.seeds_ii
#   24     24 0.0000000000     4944 ...future.globals.maxSize
#   25     25 0.0000000000   119960                         {
#   26     26 0.0004999638     5512                      <NA>
#   27     27 0.1976690292 51929376             ...future.FUN
#   28     28 0.0000000000     5232     future.call.arguments
#   29     29 0.0705609322 25965096     ...future.elements_ii
#   30     30 0.0000000000     4944        ...future.seeds_ii
#   31     31 0.0004999638     4944 ...future.globals.maxSize
#   32     32 0.0005011559   119960                         {
``` 

The interesting part here are the lines 20 and 27, corresponding to the export of the function of `future_lapply` to each of the nodes. According to `pryr::object_size()`, a large amount of data is sent to the worker at that point. The elapsed time is, in consequence, significant for the function's exportation.

In contrast, the same analysis on the non-nested call to `future_lapply()` does not show these large exports.
``` r
tracker <- 1
cat(r"("Callnr", "Elapsed", "Size", "Object")", "\n", file = "log.txt", sep = "")
plan(multisession, workers = 2)
future_lapply(long_characters, some_function1)
res <- read.csv("log.txt")
res
#      Callnr      Elapsed     Size                    Object
#   1       1 0.0004999638      552                      <NA>
#   2       2 0.0000000000      552                      <NA>
#   3       3 0.0000000000     3936                Sys.getpid
#   4       4 0.0000000000     3936                Sys.getpid
#   5       5 0.0000000000     8328                     FALSE
#   6       6 0.0000000000     3936                Sys.getpid
#   7       7 0.0000000000     8328                     FALSE
#   8       8 0.0000000000     3936                Sys.getpid
#   9       9 0.0000000000     5512                      <NA>
#   10     10 0.0004999638   115456                         {
#   11     11 0.0004999638     5512                      <NA>
#   12     12 0.0000000000     7680             ...future.FUN
#   13     13 0.0000000000     5232     future.call.arguments
#   14     14 0.6837708950 25965096     ...future.elements_ii
#   15     15 0.0000000000     4944        ...future.seeds_ii
#   16     16 0.0000000000     4944 ...future.globals.maxSize
#   17     17 0.0004999638   119960                         {
#   18     18 0.0000000000     5512                      <NA>
#   19     19 0.0000000000     7680             ...future.FUN
#   20     20 0.0000000000     5232     future.call.arguments
#   21     21 0.0700600147 25965096     ...future.elements_ii
#   22     22 0.0000000000     4944        ...future.seeds_ii
#   23     23 0.0005002022     4944 ...future.globals.maxSize
#   24     24 0.0005009174   119960                         {
``` 
For the non-nested case, the size of the function's exportation is much smaller.

After investigating the `...future.FUN` object more in detail, it seems that it is related to the functions' environment. When serializing the function, the environment is serialized as well, which basically enforces the exportation of the entire input for `future_lapply()` to each node.

To prove this point, I attempted to replace the environment of the exported function by the global environment:
``` r
sendData.SOCK0node <- function(node, data) {
  start <- Sys.time()
  if (length(data$data$args) > 1) {
    if (is.function(data$data$args[[2]])) environment(data$data$args[[2]]) <- .GlobalEnv
  }
  on.exit({
    elapsed <- Sys.time() - start
    res <- data.frame(
      Callnr = tracker,
      Elapsed = elapsed,
      Size = format(pryr::object_size(data)),
      Object = NA
    )
    if (length(data$data$args) > 0) res$Object <- as.character(data$data$args[[1]][1])
    write.table(
      res, file = "log.txt", append = TRUE, col.names = FALSE, sep = ",",
      row.names = FALSE
    )
    tracker <<- tracker + 1
  })
  serialize(data, node$con, xdr = FALSE)
}
unlockBinding("sendData.SOCK0node", env = getNamespace("parallel"))
assignInNamespace("sendData.SOCK0node", sendData.SOCK0node, "parallel")
tracker <- 1
cat(r"("Callnr", "Elapsed", "Size", "Object")", "\n", file = "log.txt", sep = "")
plan(multisession, workers = 2)
system.time(some_function2(long_characters))
res <- read.csv("log.txt")
res
#      Callnr      Elapsed     Size                    Object
#   1       1 0.0000000000      552                      <NA>
#   2       2 0.0004999638      552                      <NA>
#   3       3 0.0000000000     3936                Sys.getpid
#   4       4 0.0000000000     3936                Sys.getpid
#   5       5 0.0000000000     8328                     FALSE
#   6       6 0.0000000000     3936                Sys.getpid
#   7       7 0.0000000000     8328                     FALSE
#   8       8 0.0000000000     3936                Sys.getpid
#   9       9 0.0000000000     5512                      <NA>
#   10     10 0.0000000000   115456                         {
#   11     11 0.0000000000     5512                      <NA>
#   12     12 0.0000000000     7744             ...future.FUN
#   13     13 0.0000000000     5232     future.call.arguments
#   14     14 0.0690588951 25965096     ...future.elements_ii
#   15     15 0.0000000000     4944        ...future.seeds_ii
#   16     16 0.0000000000     4944 ...future.globals.maxSize
#   17     17 0.0000000000   119960                         {
#   18     18 0.0000000000     5512                      <NA>
#   19     19 0.0005009174     7744             ...future.FUN
#   20     20 0.0000000000     5232     future.call.arguments
#   21     21 0.3738429546 25965096     ...future.elements_ii
#   22     22 0.0004999638     4944        ...future.seeds_ii
#   23     23 0.0000000000     4944 ...future.globals.maxSize
#   24     24 0.0000000000   119960                         {
``` 
Despite calling the nested call to `future_lapply()`, it now seems that the enclosing environment of the function is not exported anymore. I'm not sure of the implication on other functions (since now this is only about `identity()`), but the improvement brought by this little change indicates that some changes in that direction could significantly speed up the use of `future_lapply()` in other packages.

Finally, I re-benchmarked the functions with 10 cores with this adjusted `...future.FUN` environment:
``` r
sendData.SOCK0node <- function(node, data) {
  start <- Sys.time()
  if (length(data$data$args) > 1) {
    if (is.function(data$data$args[[2]])) environment(data$data$args[[2]]) <- .GlobalEnv
  }
  serialize(data, node$con, xdr = FALSE)
}
unlockBinding("sendData.SOCK0node", env = getNamespace("parallel"))
assignInNamespace("sendData.SOCK0node", sendData.SOCK0node, "parallel")
plan(multisession, workers = 10)
benchmark <- bench::mark(
  `1` = future_lapply(long_characters, identity),
  `2` = future_lapply(long_characters, some_function1),
  `3` = some_function2(long_characters),
  `4` = foreach(x = long_characters, .final = function(x) setNames(x, names(long_characters))
  ) %dopar% {some_function1(x)},
  filter_gc = FALSE,
  min_iterations = 5
)
benchmark[, c(1:5, 7)]
# # A tibble: 4 x 6
# expression        min   median `itr/sec` mem_alloc n_itr
# <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt> <int>
# 1 1             1.22s    1.53s     0.606    4.08MB     5
# 2 2             1.21s    1.51s     0.659    4.09MB     5
# 3 3          898.29ms    1.52s     0.659    4.09MB     5
# 4 4             1.44s    1.46s     0.689   55.49MB     5
``` 
The nested loop is drastically improved and is now comparable with the first two options! Of course, I'm sure that naively setting the global environment to the function is not the best way to handle this issue. But since a good use-case of `future_lapply` is to essentially replace `lapply()` in packages, letting the user free to set the parallelism with `plan()`, is it a shame that the usage of `future_lapply()` inside a function suffers from this large overhead.

My session infos:
``` r
sessionInfo()
# R version 4.1.2 (2021-11-01)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 19043)
# 
# Matrix products: default
# 
# locale:
# [1] LC_COLLATE=French_Belgium.1252  LC_CTYPE=French_Belgium.1252    LC_MONETARY=French_Belgium.1252 LC_NUMERIC=C                   
# [5] LC_TIME=French_Belgium.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] doFuture_0.12.0    foreach_1.5.2      future.apply_1.8.1 future_1.24.0     
# 
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.8        rstudioapi_0.13   parallelly_1.30.0 lobstr_1.1.1      knitr_1.37        magrittr_2.0.2    rlang_1.0.1      
# [8] fastmap_1.1.0     fansi_1.0.2       stringr_1.4.0     globals_0.14.0    tools_4.1.2       parallel_4.1.2    xfun_0.29        
# [15] sessioninfo_1.2.2 utf8_1.2.2        cli_3.2.0         htmltools_0.5.2   iterators_1.0.14  ellipsis_0.3.2    yaml_2.3.5       
# [22] digest_0.6.29     tibble_3.1.6      lifecycle_1.0.1   crayon_1.5.0      pryr_0.1.5        vctrs_0.3.8       codetools_0.2-18 
# [29] glue_1.6.1        evaluate_0.15     rmarkdown_2.11    bench_1.1.2       stringi_1.7.6     compiler_4.1.2    pillar_1.7.0     
# [36] profmem_0.6.0     listenv_0.8.0     pkgconfig_2.0.3  
``` 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overhead when exporting to PSOCK cluster #98

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Overhead when exporting to PSOCK cluster #98

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions