6

I am looking for a functionality similar to the distributed cache of Hadoop in Spark. I need a relatively small data file (with some index values) to be present in all nodes in order to make some calculations. Is there any approach that makes this possible in Spark?

My workaround so far consists on distributing and reducing the index file as a normal processing, which takes around 10 seconds in my application. After that, I persist the file indicating it as a broadcast variable, as follows:

JavaRDD<String> indexFile = ctx.textFile("s3n://mybucket/input/indexFile.txt",1);
ArrayList<String> localIndex = (ArrayList<String>) indexFile.collect();    

final Broadcast<ArrayList<String>> globalIndex = ctx.broadcast(indexVar);

This makes the program able to understand what the variable globalIndex contains. So far it is a patch that might be okay for me, but I consider it is not the best solution. Would it still be effective with a considerably bigger data-set or a big amount of variables?

Note: I am using Spark 1.0.0 running on a Standalone cluster located at several EC2 instances.

4

2 回答 2

6

请看SparkContext.addFile()方法。猜猜这就是你要找的。

于 2016-02-19T00:49:19.987 回答
0

只要我们使用广播变量,它也应该对更大的数据集有效。

来自 Spark 文档 “广播变量允许程序员在每台机器上保留一个缓存的只读变量,而不是随任务传送它的副本。例如,它们可以用于为每个节点提供一个大型输入数据集的副本以一种有效的方式。”

于 2015-01-28T13:19:28.150 回答