Skip to content

Overhead when exporting to PSOCK cluster #98

@odelmarcelle

Description

@odelmarcelle

Describe the bug
I believe that a significant (and avoidable) overhead is present when using future_lapply inside another function. I think this might be related to the exportation of ...future.FUN, which is serialized with its enclosing environment.

I experience this issue on Windows and had not the chance to give it a try on macOS. I suspect this is only relevant to Windows.

Expected behavior
The overhead in deploying a function on multiple cores should remain limited.

Reproduce example
Consider the following (lengthy) example, where I apply identity() to a large list of characters.

long_characters <- setNames(replicate(
  10000,
  paste0(c(letters, rep(" ", 10))[
    sample.int(length(letters) + 1, 5000, replace = TRUE)],
    collapse = ""),
  simplify = FALSE
), paste0("element", 1:10000))
object.size(long_characters) |> format("Mb")
## [1] "49.5 Mb"

some_function1 <- function(x) {
  identity(x)
}
some_function2 <- function(x) {
  future_lapply(x, function(y) {
    identity(y)
  })
}

The second function is nesting 'future_lapply' and this appears to have important implications regarding the parallel processing overhead. I quickly benchmark the functions for a sequential backend.

library(future)
library(future.apply)
library(foreach); library(doFuture); registerDoFuture()
plan(sequential)
benchmark <- bench::mark(
  `1` = future_lapply(long_characters, identity),
  `2` = future_lapply(long_characters, some_function1),
  `3` = some_function2(long_characters),
  `4` = foreach(x = long_characters, .final = function(x) setNames(x, names(long_characters))
  ) %dopar% {some_function1(x)},
  filter_gc = FALSE,
  min_iterations = 5
)
benchmark[, c(1:5, 7)]
# # A tibble: 4 x 6
# expression        min   median `itr/sec` mem_alloc n_itr
# <bch:expr>   <bch:tm> <bch:tm>    <dbl>  <bch:byt> <int>
# 1 1            22.1ms  23.08ms    41.7      4.29MB    21
# 2 2           25.89ms  26.69ms    36.0   1006.98KB    18
# 3 3            26.5ms  27.39ms    34.2   1006.48KB    18
# 4 4             1.21s    1.24s     0.809   53.67MB     5

All apply functions are more or less similar, and foreach is slower, apparently due to some memory allocation. The results change drastically once a multisession is registered:

plan(multisession, workers = 2)
benchmark <- bench::mark(
  `1` = future_lapply(long_characters, identity),
  `2` = future_lapply(long_characters, some_function1),
  `3` = some_function2(long_characters),
  `4` = foreach(x = long_characters, .final = function(x) setNames(x, names(long_characters))
  ) %dopar% {some_function1(x)},
  filter_gc = FALSE,
  min_iterations = 5
)
benchmark[, c(1:5, 7)]
# # A tibble: 4 x 6
# expression        min   median `itr/sec` mem_alloc n_itr
# <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt> <int>
# 1 1           574.9ms 579.89ms     1.57     1.91MB     5
# 2 2          568.84ms 570.91ms     1.68     1.83MB     5
# 3 3          861.42ms    1.16s     0.910    1.84MB     5
# 4 4             1.37s    1.44s     0.702   53.66MB     5

Here the nested call of future_lapply() inside some_function2() appears to suffer compared to the first two alternatives. The foreach option is relatively unaffected by the changed plan.

The gap between the nested and non-nested future_lapply() increases as the number of cores rises:

plan(multisession, workers = 10)
benchmark <- bench::mark(
  `1` = future_lapply(long_characters, identity),
  `2` = future_lapply(long_characters, some_function1),
  `3` = some_function2(long_characters),
  `4` = foreach(x = long_characters, .final = function(x) setNames(x, names(long_characters))
  ) %dopar% {some_function1(x)},
  filter_gc = FALSE,
  min_iterations = 5
)
benchmark[, c(1:5, 7)]
# # A tibble: 4 x 6
# expression        min   median `itr/sec` mem_alloc n_itr
# <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt> <int>
# 1 1          880.32ms     1.5s     0.724    4.17MB     5
# 2 2          580.24ms    1.21s     0.977    4.09MB     5
# 3 3             4.45s    4.75s     0.208    4.17MB     5
# 4 4              1.4s    1.42s     0.568   55.48MB     5

Now future_lapply() is even slower than foreach! Foreach is, in contrast with the other options, barely affected by the increase in the number of processes.

To investigate the matter further, I implemented a very dirty replacement of the internal function sendData.SOCK0node() of the parallel package. Doing so, I added some lines to write on a logging file the time elapsed for each exportation to a cluster node.
I know that future already implements many monitoring tools when setting options(future.debug = TRUE), but I felt something off when observing the log of exported values. Instead, I implemented this logging system using pryr::pryr::object_size(), supposed to be more reliable than utils::object.size(). The results suggest that the enclosing environment of future_lapply() is unnecessarily exported to each node.

Here is the dirty replacement to the parallel function.

sendData.SOCK0node <- function(node, data) {
  start <- Sys.time()
  on.exit({
    elapsed <- Sys.time() - start
    res <- data.frame(
      Callnr = tracker,
      Elapsed = elapsed,
      Size = format(pryr::object_size(data)),
      Object = NA
    )
    if (length(data$data$args) > 0) res$Object <- as.character(data$data$args[[1]][1])
    write.table(
      res, file = "log.txt", append = TRUE, col.names = FALSE, sep = ",",
      row.names = FALSE
    )
    tracker <<- tracker + 1
  })
  serialize(data, node$con, xdr = FALSE)
}
unlockBinding("sendData.SOCK0node", env = getNamespace("parallel"))
assignInNamespace("sendData.SOCK0node", sendData.SOCK0node, "parallel")
# Error in get(S3[i, 1L], mode = "function", envir = parent.frame()) : 
#   object 'sendData' of mode 'function' was not found
parallel:::sendData.SOCK0node
# function(node, data) {
#   start <- Sys.time()
#   on.exit({
#     elapsed <- Sys.time() - start
#     res <- data.frame(
#       Callnr = tracker,
#       Elapsed = elapsed,
#       Size = format(pryr::object_size(data)),
#       Object = NA
#     )
#     if (length(data$data$args) > 0) res$Object <- as.character(data$data$args[[1]][1])
#     write.table(
#       res, file = "log.txt", append = TRUE, col.names = FALSE, sep = ",",
#       row.names = FALSE
#     )
#     tracker <<- tracker + 1
#   })
#   serialize(data, node$con, xdr = FALSE)
# }

Although this throw an error, it was enough (for me at least) to replace the existing function in the parallel package. I know this seems barbaric, but my limited knowledge of future's internals led me to this.

Hence, I was able to observe the data transferred to each node by monitoring the size with pryr::object_size() for the nested future_lapply() call.

tracker <- 1
cat(r"("Callnr", "Elapsed", "Size", "Object")", "\n", file = "log.txt", sep = "")
plan(multisession, workers = 2)
some_function2(long_characters)
res <- read.csv("log.txt")
res
#      Callnr      Elapsed     Size                    Object
#   1       1 0.0004999638      552                      <NA>
#   2       2 0.0005021095      552                      <NA>
#   3       3 0.0000000000      552                      <NA>
#   4       4 0.0000000000      552                      <NA>
#   5       5 0.0000000000      552                      <NA>
#   6       6 0.0000000000      552                      <NA>
#   7       7 0.0000000000      552                      <NA>
#   8       8 0.0000000000      552                      <NA>
#   9       9 0.0000000000      552                      <NA>
#   10     10 0.0000000000      552                      <NA>
#   11     11 0.0000000000     3936                Sys.getpid
#   12     12 0.0000000000     3936                Sys.getpid
#   13     13 0.0000000000     8328                     FALSE
#   14     14 0.0000000000     3936                Sys.getpid
#   15     15 0.0000000000     8328                     FALSE
#   16     16 0.0000000000     3936                Sys.getpid
#   17     17 0.0000000000     5512                      <NA>
#   18     18 0.0000000000   115456                         {
#   19     19 0.0000000000     5512                      <NA>
#   20     20 0.1896619797 51929376             ...future.FUN
#   21     21 0.0000000000     5232     future.call.arguments
#   22     22 0.0695588589 25965096     ...future.elements_ii
#   23     23 0.0000000000     4944        ...future.seeds_ii
#   24     24 0.0000000000     4944 ...future.globals.maxSize
#   25     25 0.0000000000   119960                         {
#   26     26 0.0004999638     5512                      <NA>
#   27     27 0.1976690292 51929376             ...future.FUN
#   28     28 0.0000000000     5232     future.call.arguments
#   29     29 0.0705609322 25965096     ...future.elements_ii
#   30     30 0.0000000000     4944        ...future.seeds_ii
#   31     31 0.0004999638     4944 ...future.globals.maxSize
#   32     32 0.0005011559   119960                         {

The interesting part here are the lines 20 and 27, corresponding to the export of the function of future_lapply to each of the nodes. According to pryr::object_size(), a large amount of data is sent to the worker at that point. The elapsed time is, in consequence, significant for the function's exportation.

In contrast, the same analysis on the non-nested call to future_lapply() does not show these large exports.

tracker <- 1
cat(r"("Callnr", "Elapsed", "Size", "Object")", "\n", file = "log.txt", sep = "")
plan(multisession, workers = 2)
future_lapply(long_characters, some_function1)
res <- read.csv("log.txt")
res
#      Callnr      Elapsed     Size                    Object
#   1       1 0.0004999638      552                      <NA>
#   2       2 0.0000000000      552                      <NA>
#   3       3 0.0000000000     3936                Sys.getpid
#   4       4 0.0000000000     3936                Sys.getpid
#   5       5 0.0000000000     8328                     FALSE
#   6       6 0.0000000000     3936                Sys.getpid
#   7       7 0.0000000000     8328                     FALSE
#   8       8 0.0000000000     3936                Sys.getpid
#   9       9 0.0000000000     5512                      <NA>
#   10     10 0.0004999638   115456                         {
#   11     11 0.0004999638     5512                      <NA>
#   12     12 0.0000000000     7680             ...future.FUN
#   13     13 0.0000000000     5232     future.call.arguments
#   14     14 0.6837708950 25965096     ...future.elements_ii
#   15     15 0.0000000000     4944        ...future.seeds_ii
#   16     16 0.0000000000     4944 ...future.globals.maxSize
#   17     17 0.0004999638   119960                         {
#   18     18 0.0000000000     5512                      <NA>
#   19     19 0.0000000000     7680             ...future.FUN
#   20     20 0.0000000000     5232     future.call.arguments
#   21     21 0.0700600147 25965096     ...future.elements_ii
#   22     22 0.0000000000     4944        ...future.seeds_ii
#   23     23 0.0005002022     4944 ...future.globals.maxSize
#   24     24 0.0005009174   119960                         {

For the non-nested case, the size of the function's exportation is much smaller.

After investigating the ...future.FUN object more in detail, it seems that it is related to the functions' environment. When serializing the function, the environment is serialized as well, which basically enforces the exportation of the entire input for future_lapply() to each node.

To prove this point, I attempted to replace the environment of the exported function by the global environment:

sendData.SOCK0node <- function(node, data) {
  start <- Sys.time()
  if (length(data$data$args) > 1) {
    if (is.function(data$data$args[[2]])) environment(data$data$args[[2]]) <- .GlobalEnv
  }
  on.exit({
    elapsed <- Sys.time() - start
    res <- data.frame(
      Callnr = tracker,
      Elapsed = elapsed,
      Size = format(pryr::object_size(data)),
      Object = NA
    )
    if (length(data$data$args) > 0) res$Object <- as.character(data$data$args[[1]][1])
    write.table(
      res, file = "log.txt", append = TRUE, col.names = FALSE, sep = ",",
      row.names = FALSE
    )
    tracker <<- tracker + 1
  })
  serialize(data, node$con, xdr = FALSE)
}
unlockBinding("sendData.SOCK0node", env = getNamespace("parallel"))
assignInNamespace("sendData.SOCK0node", sendData.SOCK0node, "parallel")
tracker <- 1
cat(r"("Callnr", "Elapsed", "Size", "Object")", "\n", file = "log.txt", sep = "")
plan(multisession, workers = 2)
system.time(some_function2(long_characters))
res <- read.csv("log.txt")
res
#      Callnr      Elapsed     Size                    Object
#   1       1 0.0000000000      552                      <NA>
#   2       2 0.0004999638      552                      <NA>
#   3       3 0.0000000000     3936                Sys.getpid
#   4       4 0.0000000000     3936                Sys.getpid
#   5       5 0.0000000000     8328                     FALSE
#   6       6 0.0000000000     3936                Sys.getpid
#   7       7 0.0000000000     8328                     FALSE
#   8       8 0.0000000000     3936                Sys.getpid
#   9       9 0.0000000000     5512                      <NA>
#   10     10 0.0000000000   115456                         {
#   11     11 0.0000000000     5512                      <NA>
#   12     12 0.0000000000     7744             ...future.FUN
#   13     13 0.0000000000     5232     future.call.arguments
#   14     14 0.0690588951 25965096     ...future.elements_ii
#   15     15 0.0000000000     4944        ...future.seeds_ii
#   16     16 0.0000000000     4944 ...future.globals.maxSize
#   17     17 0.0000000000   119960                         {
#   18     18 0.0000000000     5512                      <NA>
#   19     19 0.0005009174     7744             ...future.FUN
#   20     20 0.0000000000     5232     future.call.arguments
#   21     21 0.3738429546 25965096     ...future.elements_ii
#   22     22 0.0004999638     4944        ...future.seeds_ii
#   23     23 0.0000000000     4944 ...future.globals.maxSize
#   24     24 0.0000000000   119960                         {

Despite calling the nested call to future_lapply(), it now seems that the enclosing environment of the function is not exported anymore. I'm not sure of the implication on other functions (since now this is only about identity()), but the improvement brought by this little change indicates that some changes in that direction could significantly speed up the use of future_lapply() in other packages.

Finally, I re-benchmarked the functions with 10 cores with this adjusted ...future.FUN environment:

sendData.SOCK0node <- function(node, data) {
  start <- Sys.time()
  if (length(data$data$args) > 1) {
    if (is.function(data$data$args[[2]])) environment(data$data$args[[2]]) <- .GlobalEnv
  }
  serialize(data, node$con, xdr = FALSE)
}
unlockBinding("sendData.SOCK0node", env = getNamespace("parallel"))
assignInNamespace("sendData.SOCK0node", sendData.SOCK0node, "parallel")
plan(multisession, workers = 10)
benchmark <- bench::mark(
  `1` = future_lapply(long_characters, identity),
  `2` = future_lapply(long_characters, some_function1),
  `3` = some_function2(long_characters),
  `4` = foreach(x = long_characters, .final = function(x) setNames(x, names(long_characters))
  ) %dopar% {some_function1(x)},
  filter_gc = FALSE,
  min_iterations = 5
)
benchmark[, c(1:5, 7)]
# # A tibble: 4 x 6
# expression        min   median `itr/sec` mem_alloc n_itr
# <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt> <int>
# 1 1             1.22s    1.53s     0.606    4.08MB     5
# 2 2             1.21s    1.51s     0.659    4.09MB     5
# 3 3          898.29ms    1.52s     0.659    4.09MB     5
# 4 4             1.44s    1.46s     0.689   55.49MB     5

The nested loop is drastically improved and is now comparable with the first two options! Of course, I'm sure that naively setting the global environment to the function is not the best way to handle this issue. But since a good use-case of future_lapply is to essentially replace lapply() in packages, letting the user free to set the parallelism with plan(), is it a shame that the usage of future_lapply() inside a function suffers from this large overhead.

My session infos:

sessionInfo()
# R version 4.1.2 (2021-11-01)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 19043)
# 
# Matrix products: default
# 
# locale:
# [1] LC_COLLATE=French_Belgium.1252  LC_CTYPE=French_Belgium.1252    LC_MONETARY=French_Belgium.1252 LC_NUMERIC=C                   
# [5] LC_TIME=French_Belgium.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] doFuture_0.12.0    foreach_1.5.2      future.apply_1.8.1 future_1.24.0     
# 
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.8        rstudioapi_0.13   parallelly_1.30.0 lobstr_1.1.1      knitr_1.37        magrittr_2.0.2    rlang_1.0.1      
# [8] fastmap_1.1.0     fansi_1.0.2       stringr_1.4.0     globals_0.14.0    tools_4.1.2       parallel_4.1.2    xfun_0.29        
# [15] sessioninfo_1.2.2 utf8_1.2.2        cli_3.2.0         htmltools_0.5.2   iterators_1.0.14  ellipsis_0.3.2    yaml_2.3.5       
# [22] digest_0.6.29     tibble_3.1.6      lifecycle_1.0.1   crayon_1.5.0      pryr_0.1.5        vctrs_0.3.8       codetools_0.2-18 
# [29] glue_1.6.1        evaluate_0.15     rmarkdown_2.11    bench_1.1.2       stringi_1.7.6     compiler_4.1.2    pillar_1.7.0     
# [36] profmem_0.6.0     listenv_0.8.0     pkgconfig_2.0.3  

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions