hadoop - Hadoop: How to collect output of Reduce into a Java HashMap

Question

I'm using Hadoop to compute co-occurrence similarity between words. I have a file that consists of co-occurring word pairs that looks like:

a b
a c
b c
b d

I'm using a Graph based approach that treats words as nodes and co-occurring words have an edge between them. My algorithm needs to compute the degree of all nodes. I've successfully written a Map-Reduce job to compute the total degree which outputs the following:

a 2
b 3
c 2
d 1

Currently, the output is written back to a file but what I want instead is to capture the result into, say, a java.util.HashMap. I, then, want to use this HashMap in an other Reduce job to compute the final similarity.

Here are my questions:

Is it possible to capture results of reduce job in memory (List, Map). If so, how ?
Is this the best approach ? If not, How should I deal with this ?

score 1 · Accepted Answer

有两种可能性：或者您从分布式文件系统中读取 map/reduce 任务中的数据。或者直接将其添加到分布式缓存中。我刚刚搜索了分布式缓存大小，它可以控制：

“local.cache.size 参数控制 DistributedCache 的大小。默认情况下，它设置为 10 GB。”

链接到 cloudera 博客

因此，如果您将第一个作业的输出添加到第二个作业的分布式缓存中，我认为应该没问题。数以万计的条目远不及千兆字节范围。

将文件添加到分布式缓存如下：

在你的映射器中阅读：

Path[] uris = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String patternsFile = uris[0].toString();
BufferedReader in = new BufferedReader(new FileReader(patternsFile));

添加到 DBCache：

DistributedCache.addCacheFile(new URI(file), job.getConfiguration());

在设置第二份工作时。

让我知道这是否有用。

hadoop - Hadoop: How to collect output of Reduce into a Java HashMap

1 回答 1

Related

Reference