0

我有三组按类型分隔的数据,通常每个 uid 只有几百个元组。但是(可能是由于一些错误)很少有 uid 具有多达 200000-300000 行数据。

当单个数据包中的元组太多时,StuffProcessor 有时会引发堆空间错误。我应该如何解决这个问题?我可以以某种方式检查单个 uid 是否有例如 100000+ 个元组,然后将数据拆分成更小的批次?

我对猪完全陌生,几乎不知道我在做什么。

-- Create union of the three stuffs 
stuff = UNION stuff1, stuff2, stuff3;

-- Group data by uid
stuffGrouped = group stuff by (long)$0;

-- Process data
processedStuff = foreach stuffGrouped generate StuffProcessor(stuff);

-- Flatten the UID groups into single table
flatProcessedStuff = foreach processedStuff generate FLATTEN($0);

-- Separate into different datasets by type, these are all schemaless
processedStuff1 = filter flatProcessedStuff by (int)$5 == 9;
processedStuff2 = filter flatProcessedStuff by (int)$5 == 17;
processedStuff3 = filter flatProcessedStuff by (int)$5 == 20;

-- Store everything into separate files into HDFS
store processedStuff1 into '$PROCESSING_DIR/stuff1.txt';
store processedStuff2 into '$PROCESSING_DIR/stuff2.txt';
store processedStuff3 into '$PROCESSING_DIR/stuff3.txt';

Cloudera 集群应分配 4GB 堆空间

这实际上可能与 cloudera 用户有关,因为我无法在某些用户(piggy 用户与 hdfs 用户)中重现此问题。

4

1 回答 1

1

如果您的 UDF 并不真的需要同时查看属于某个键的所有元组,您可能需要实现Accumulator接口以便以较小的批次处理它们。您还可以考虑实现代数接口以加快该过程。

内置的COUNT就是一个完美的例子。

于 2013-08-30T20:16:58.467 回答