r - 并行处理的负载平衡

Question

我正在运行一个类似于查找标准偏差的函数......但运行时间要长得多。

我打算使用该函数来计算标准偏差的累积值，即第 1 天到第 n 天的标准偏差类型函数。

但是由于计算需要很长时间，我想在集群上运行它。

因此，我想将数据拆分，以便集群的每个节点大致在同一时间完成。例如，如果我的功能如下，单机方法将按以下方式工作：

vec <- xts(rnorm(1000),Sys.Date()-(1:1000)
lapply(1:length(vec), function(x){
    Sys.sleep(30)
    sd(as.numeric(vec[1:x]))
}

（注意 sys.sleep 被添加在那里以表示处理我的自定义函数所花费的额外时间）

但是，假设我想将其拆分为两台机器，而不是 1，我将如何拆分向量1:length(vec)，以便我可以为每台机器提供 c(1:y)机器 1 和c((y+1):length(vec))机器 2 的列表，以便两台机器按时完成。即 y 的值是多少，这样两个过程将大致同时完成......如果我们要在 10 台机器上完成它会怎样......如何找到原始向量c(1:length(vec))中的中断去工作...

即我会

y <- 750 # This is just a guess as to potentially where it might be.
vec <- xts(rnorm(1000),Sys.Date()-(1:1000)
# on machine 1 I would have
lapply(1:y, function(x){
    Sys.sleep(30)
    sd(as.numeric(vec[1:x]))
}

# and on machine 2 I would have

lapply(y+1:length(vec), function(x){
    Sys.sleep(30)
    sd(as.numeric(vec[1:x]))
}

score 6 · Accepted Answer

并行包现在是基本 R 的一部分，可以帮助在中等规模的集群上运行 R，包括在 Amazon EC2 上。parLapplyLB 函数将来自输入向量的工作分配到集群的工作节点上。

要知道的一件事是makePSOCKcluster（目前从 R 2.15.2 开始）受connections.c 中的 NCONNECTIONS 常数限制为 128 个工作人员。

这是一个使用并行包的会话的快速示例，您可以在自己的机器上尝试：

library(parallel)
help(package=parallel)

## create the cluster passing an IP address for
## the head node
## hostname -i works on Linux, but not on BSD
## descendants (like OS X)
# cl <- makePSOCKcluster(hosts, master=system("hostname -i", intern=TRUE))

## for testing, start a cluster on your local machine
cl <- makePSOCKcluster(rep("localhost", 3))

## do something once on each worker
ans <- clusterEvalQ(cl, { mean(rnorm(1000)) })

## push data to the workers
myBigData <- rnorm(10000)
moreData <- c("foo", "bar", "blabber")
clusterExport(cl, c('myBigData', 'moreData'))

## test a time consuming job
## (~30 seconds on a 4 core machine)
system.time(ans <- parLapplyLB(cl, 1:100, function(i) {
  ## summarize a bunch of random sample means
  summary(
    sapply(1:runif(1, 100, 2000),
           function(j) { mean(rnorm(10000)) }))
}))

## shut down worker processes
stopCluster(cl)

Bioconductor 小组建立了一种非常简单的入门方法：在云中使用并行集群

有关在 EC2 上使用并行包的更多信息，请参阅：云中的 R 和一般集群上的 R，请参阅：CRAN 任务视图：使用 R 进行高性能和并行计算。

最后，R 之外的另一个完善的选项是Starcluster。

score 2 · Accepted Answer

查看雪包——特别是clusterApplyLB处理负载平衡应用函数的函数。

这实际上将更智能地处理节点/核心的工作分配，而不仅仅是一个均匀的分区。

score 1 · Accepted Answer

1

考虑通过RHIPE使用 Hadoop（又名 MapReduce）。

于 2012-12-11T04:44:15.000 回答

r - 并行处理的负载平衡

3 回答 3

Related

Reference