我想说有两种不同的可能解决您的问题的方法。
第一个是根据预期的每个作业计算时间对作业到节点映射的静态优化。在开始计算之前,您将为每个作业(即数据框的行)分配一个节点。下面给出了一个可能的实现代码。
第二种解决方案是动态的,您必须根据clusterApplyLB
. 您将开始与第一种方法相同,但一旦作业完成,您将不得不重新计算最佳作业到节点的映射。根据您的问题,由于发生的不断重新优化,这可能会增加大量开销。我认为只要您的预期计算时间没有偏差,就没有必要这样做。
这里是第一种解决方案的代码:
library(parallel)
#set seed for reproducible example
set.seed(1234)
#let's say you have 100 calculations (i.e., rows)
#each of them takes between 0 and 1 second computation time
expected_job_length=runif(100)
#this is your data
#real_job_length is unknown but we use it in the mock-up function below
df=data.frame(job_id=seq_along(expected_job_length),
expected_job_length=expected_job_length,
#real_job_length=expected_job_length + some noise
real_job_length=expected_job_length+
runif(length(expected_job_length),-0.05,0.05))
#we might have a negative real_job_length; fix that
df=within(df,real_job_length[real_job_length<0]<-
real_job_length[real_job_length<0]+0.05)
#detectCores() gives in my case 4
cluster_size=4
准备作业到节点的映射优化:
#x will give the node_id (between 1 and cluster_size) for each job
total_time=function(x,expected_job_length) {
#in the calculation below, x will be a vector of reals
#we have to translate it into integers in order to use it as index vector
x=as.integer(round(x))
#return max of sum of node-binned expected job lengths
max(sapply(split(expected_job_length,x),sum))
}
#now optimize the distribution of jobs amongst the nodes
#Genetic algorithm might be better for the optimization
#but Differential Evolution is good for now
library(DEoptim)
#pick large differential weighting factor (F) ...
#... to get out of local minimas due to rounding
res=DEoptim(fn=total_time,
lower=rep(1,nrow(df)),
upper=rep(cluster_size,nrow(df)),
expected_job_length=expected_job_length,
control=DEoptim.control(CR=0.85,F=1.5,trace=FALSE))
#wait for a minute or two ...
#inspect optimal solution
time_per_node=sapply(split(expected_job_length,
unname(round(res$optim$bestmem))),sum)
time_per_node
# 1 2 3 4
#10.91765 10.94893 10.94069 10.94246
plot(time_per_node,ylim=c(0,15))
abline(h=max(time_per_node),lty=2)
#add node-mapping to df
df$node_id=unname(round(res$optim$bestmem))
现在是在集群上进行计算的时候了:
#start cluster
workers=parallel::makeCluster(cluster_size)
start_time=Sys.time()
#distribute jobs according to optimal node-mapping
clusterApply(workers,split(df,df$node_id),function(x) {
for (i in seq_along(x$job_id)) {
#use tryCatch to do the error handling for jobs that fail
tryCatch({Sys.sleep(x[i,"real_job_length"])},
error=function(err) {print("Do your error handling")})
}
})
end_time=Sys.time()
#how long did it take
end_time-start_time
#Time difference of 11.12532 secs
#add to plot
abline(h=as.numeric(end_time-start_time),col="red",lty=2)
stopCluster(workers)