我有三组按类型分隔的数据,通常每个 uid 只有几百个元组。但是(可能是由于一些错误)很少有 uid 具有多达 200000-300000 行数据。
当单个数据包中的元组太多时,StuffProcessor 有时会引发堆空间错误。我应该如何解决这个问题?我可以以某种方式检查单个 uid 是否有例如 100000+ 个元组,然后将数据拆分成更小的批次?
我对猪完全陌生,几乎不知道我在做什么。
-- Create union of the three stuffs
stuff = UNION stuff1, stuff2, stuff3;
-- Group data by uid
stuffGrouped = group stuff by (long)$0;
-- Process data
processedStuff = foreach stuffGrouped generate StuffProcessor(stuff);
-- Flatten the UID groups into single table
flatProcessedStuff = foreach processedStuff generate FLATTEN($0);
-- Separate into different datasets by type, these are all schemaless
processedStuff1 = filter flatProcessedStuff by (int)$5 == 9;
processedStuff2 = filter flatProcessedStuff by (int)$5 == 17;
processedStuff3 = filter flatProcessedStuff by (int)$5 == 20;
-- Store everything into separate files into HDFS
store processedStuff1 into '$PROCESSING_DIR/stuff1.txt';
store processedStuff2 into '$PROCESSING_DIR/stuff2.txt';
store processedStuff3 into '$PROCESSING_DIR/stuff3.txt';
Cloudera 集群应分配 4GB 堆空间
这实际上可能与 cloudera 用户有关,因为我无法在某些用户(piggy 用户与 hdfs 用户)中重现此问题。