r - 用于令人尴尬的并行处理的进一步子集数据帧

Question

我有一个令人尴尬的并行问题，我正在使用snowfall包及其函数sfLapply处理该问题。它很好用，只是我需要一种更好的方法来解决我的问题。我传入的数据框如下所示：

Group          Date
1            02/01/12
4            02/01/12
...          ...(31 items)
13           02/01/13
4            02/18/13
5            02/18/13
...          ...(9 items)
22           02/18/13

并且需要按日期分成处理组。麻烦的是，只有大约 5 个不同的日期，所以只使用

split(processing.groups, processing.groups$date)

导致并行作业太少。我想要的是一种获取列表的优雅方式，其中每个列表元素包含不超过 20 个条目，但保证它们都具有相同的日期。

例子：

List Elem 1:  20 items
1             02/01/12
4             02/01/12
...           ...
9             02/01/12
List Elem 2:  14 items
99            02/01/12
17            02/01/12
...           ...
13            02/01/12
List Elem 3:  11 items
4             02/18/13
5             02/18/13
...           ...
22            02/18/13

感觉就像一些棘手的 listy cutty splitty 语法应该能够巧妙地实现这一点。有什么建议么？

score 1 · Accepted Answer

我不确定这是否优雅，但是......

# just to setup a dummy dataframe
z <- data.frame(group=1:200, date=sample(c("a","b","c","d"),200,replace=TRUE))

splitz <- split(z, z$date) # split it once
newsplit <- list() # create something to dump the results into
# split the already split stuff into chunks of <= 20
twicesplit <- sapply(splitz, FUN= function(x){
    newsplit <<- c(newsplit,split(x, findInterval(1:dim(x)[1],(1:20*20))) )
    # the `*20` here would have to be longer if you had more than 400 observations with same date
})
rm(twicesplit) # cleanup unnecessary variable used to suppress printing

score 1 · Accepted Answer

这是一种方法：

mydf <- data.frame( Group= sample(45, 45), 
  Date = rep( c('02/01/12', '02/18/13'), c(34, 11) ) )

tmp <- ave( mydf$Group, mydf$Date, 
    FUN=function(x) rep( seq( ceiling(length(x)/20) ),
    each=20, length.out=length(x) ) )

outlist <- split( mydf, interaction(tmp, mydf$Date, drop=TRUE) )

r - 用于令人尴尬的并行处理的进一步子集数据帧

2 回答 2

Related

Reference