1

Have Revolution Enterprise. Want to run 2 simple but computationally intensive operations on each of 121k files in a directory, outputting to new files. Was hoping to use some Revoscaler function that chunked/parallel processed the data similarly to lapply. So I'd have lapply(list of files, function), but using a faster Rxdf (revoscaler) function that might actually finish, since I suspect basic lapply would never complete.

So is there a Revoscaler version of lapply? Will running it from Revolution Enterprise automatically chunk things?

I see parlapply, mclapply (http://www.inside-r.org/r-doc/parallel/clusterApply)...can I run these using cores on the same desktop? Aws servers? Do I get anything out of running these packages in Revoscaler if its not a native Rxdf function? I guess then that this is a question more on what I can use as a "cluster" in this situation.

4

1 回答 1

3

有,在单核场景中的rxExec行为类似于在多核/多进程场景中的行为。你会像这样使用它:lapplyparLapply

# vector of file names to operate on
files <- list.files()

rxSetComputeContext("localpar")
rxExec(function(fname) { 
    ...
}, fname=rxElemArg(files))

这里,func是对文件执行你想要的操作的函数;你把它传递给rxExec你想要的lapply。该rxElemArg函数告诉对每个不同的值rxExec执行。将计算上下文设置为启动本地从属进程集群,因此操作将并行运行。默认情况下,从属设备的数量为 4,但您可以使用.funcfiles"localpar"rxOptions(numCoresToUse)

您期望获得多少加速?这取决于你的数据。如果您的文件很小并且大部分时间都被计算占用,那么并行处理可以让您获得很大的加速。但是,如果您的文件很大,那么您可能会遇到 I/O 瓶颈,尤其是当所有文件都在同一个硬盘上时。

于 2015-10-22T09:11:18.080 回答