java - 级联加入两个文件非常慢

Question

我正在使用级联来执行 HashJoin 两个 300MB 文件。我执行以下级联工作流程：

// select the field which I need from the first file
Fields f1 = new Fields("id_1");
docPipe1 = new Each( docPipe1, scrubArguments, new ScrubFunction( f1 ), Fields.RESULTS );   

// select the fields which I need from the second file 
Fields f2 = new Fields("id_2","category");
docPipe2 = new Each( docPipe2, scrubArguments, new ScrubFunction( f2), Fields.RESULTS ); 

// hashJoin
Pipe tokenPipe = new HashJoin( docPipe1, new Fields("id_1"), 
                     docPipe2, new Fields("id_2"), new LeftJoin());

// count the number of each "category" based on the id_1 matching id_2
Pipe pipe = new Pipe(tokenPipe );
pipe = new GroupBy( pipe , new Fields("category"));
pipe = new Every( pipe, Fields.ALL, new Count(), Fields.ALL );

我在一个 Hadoop 集群上运行这个级联程序，它有 3 个数据节点，每个是 8 个 RAM 和 4 个内核（我将 mapred.child.java.opts 设置为 4096MB。）；但我需要大约 30 分钟才能得到最终结果。我认为它太慢了，但是我认为我的程序和集群中没有问题。我怎样才能使这个级联加入更快？

score 3 · Accepted Answer

如级联用户指南中所述

HashJoin 尝试将整个右侧流保留在内存中以进行快速比较（不仅仅是当前分组，因为没有对 HashJoin 执行分组）。因此，右侧流中非常大的元组流可能超过可配置的溢出-到磁盘阈值，降低性能并可能导致内存错误。因此，建议在右侧使用较小的流。

或者

使用可能有用的 CoGroup

score 0 · Accepted Answer

您的 hadoop 集群可能很忙或可能专门用于其他工作，因此可能会花费时间。我认为用 CoGroup 替换 HashJoin 不会对您有帮助，因为 CoGroup 是 reduce-side join，而 HashJoin 是 map-side join，因此 HashJoin 将比 ConGroup 性能更高。我认为您应该再次尝试使用不太繁忙的集群，因为您的代码看起来也不错。

java - 级联加入两个文件非常慢

2 回答 2

Related

Reference