multidplyr
这是我在我的研究所集群上运行的代码中使用调用的示例:
#create data
set.seed(1)
library(dplyr)
df <- do.call(rbind,lapply(1:100,function(i){
id.df <- data.frame(id=paste0("ID",i),value=runif(1000,0,5),group=c(rep("group1",500),rep("group2",500)))
id.df$value[sample(1000,250,replace=F)] <- 0
return(id.df)
})) %>% dplyr::group_by(id)
#simple function
tTest <- function(df)
{
df$p.value <- t.test(dplyr::filter(df,group == "group1")$value,dplyr::filter(df,group == "group2")$value)$p.value
return(df)
}
#call to multidplyr
df %>% multidplyr::partition(id) %>% multidplyr::cluster_library("tidyverse") %>% multidplyr::cluster_library("MASS") %>%
multidplyr::cluster_assign_value("tTest", tTest) %>%
do(results = tTest(.)) %>% dplyr::collect() %>% .$results %>% dplyr::bind_rows()
我收到此错误消息:
Initialising 15 core cluster.
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0.9000 ✔ readr 1.1.1
✔ tibble 1.4.2 ✔ purrr 0.2.5
✔ tidyr 0.8.1 ✔ stringr 1.3.1
✔ ggplot2 3.0.0.9000 ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
15 nodes produced errors; first error: Can't convert an environment to function
Call `rlang::last_error()` to see a backtrace
In addition: Warning message:
group_indices_.grouped_df ignores extra arguments
和:
> rlang::last_error()
NULL
所以我真的不知道从哪里开始调试这个问题。
有人有想法吗?
这sessionInfo
是有用的情况:
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /sw/R/R-3.4.3-install/lib64/R/lib/libRblas.so
LAPACK: /sw/R/R-3.4.3-install/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MASS_7.3-51 forcats_0.3.0 stringr_1.3.1 purrr_0.2.5
[5] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0.9000
[9] tidyverse_1.2.1 dplyr_0.7.7
loaded via a namespace (and not attached):
[1] Rcpp_0.12.19 cellranger_1.1.0 pillar_1.1.0
[4] compiler_3.4.3 plyr_1.8.4 bindr_0.1.1
[7] tools_3.4.3 jsonlite_1.5 lubridate_1.7.4
[10] gtable_0.2.0 nlme_3.1-137 lattice_0.20-34
[13] pkgconfig_2.0.2 rlang_0.2.2.9001 cli_1.0.1
[16] rstudioapi_0.7 parallel_3.4.3 haven_1.1.2
[19] bindrcpp_0.2.2 withr_2.1.2 xml2_1.2.0
[22] httr_1.3.1 hms_0.4.2 grid_3.4.3
[25] tidyselect_0.2.5 glue_1.3.0 R6_2.3.0
[28] readxl_1.1.0 multidplyr_0.0.0.9000 modelr_0.1.2
[31] magrittr_1.5 backports_1.1.2 scales_1.0.0.9000
[34] rvest_0.3.2 assertthat_0.2.0 colorspace_1.3-2
[37] stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0
[40] broom_0.5.0 crayon_1.3.4
在我的台式 Mac 上尝试同样的操作,它运行良好。multidplyr
我在我的工作台上通过一个简单的调用来对这个调用进行基准测试并得到:
Unit: milliseconds
expr
df %>% multidplyr::partition(id) %>% multidplyr::cluster_library("tidyverse") %>% multidplyr::cluster_library("MASS") %>% multidplyr::cluster_assign_value("tTest", tTest) %>% do(results = tTest(.)) %>% dplyr::collect() %>% .$results %>% dplyr::bind_rows()
min lq mean median uq max neval
190.7746 195.7953 206.1758 203.7096 213.0648 296.4428 100
Unit: milliseconds
expr min lq mean median uq max neval
df %>% dplyr::group_by(id) %>% tTest() 16.28017 18.2625 19.45435 18.6393 19.06788 30.07284 100
是否有一些恒定的开销multidplyr
会增加设置核心,以 测量microbenchmark
,这使得在这个玩具示例中看起来不太有利,但在更大的数据集中绝对值得?
否则,我是否错过了使用的重点multidplyr
?
非常感谢