3

I'm using Hadoop to compute co-occurrence similarity between words. I have a file that consists of co-occurring word pairs that looks like:

a b
a c
b c
b d

I'm using a Graph based approach that treats words as nodes and co-occurring words have an edge between them. My algorithm needs to compute the degree of all nodes. I've successfully written a Map-Reduce job to compute the total degree which outputs the following:

a 2
b 3
c 2
d 1

Currently, the output is written back to a file but what I want instead is to capture the result into, say, a java.util.HashMap. I, then, want to use this HashMap in an other Reduce job to compute the final similarity.

Here are my questions:

  1. Is it possible to capture results of reduce job in memory (List, Map). If so, how ?
  2. Is this the best approach ? If not, How should I deal with this ?
4

1 回答 1

1

有两种可能性:或者您从分布式文件系统中读取 map/reduce 任务中的数据。或者直接将其添加到分布式缓存中。我刚刚搜索了分布式缓存大小,它可以控制:

“local.cache.size 参数控制 DistributedCache 的大小。默认情况下,它设置为 10 GB。”

链接到 cloudera 博客

因此,如果您将第一个作业的输出添加到第二个作业的分布式缓存中,我认为应该没问题。数以万计的条目远不及千兆字节范围。

将文件添加到分布式缓存如下:

在你的映射器中阅读:

Path[] uris = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String patternsFile = uris[0].toString();
BufferedReader in = new BufferedReader(new FileReader(patternsFile));

添加到 DBCache:

DistributedCache.addCacheFile(new URI(file), job.getConfiguration());

在设置第二份工作时。

让我知道这是否有用。

于 2013-10-01T12:03:31.817 回答