我在reduce 阶段的reduce 程序中使用MultipleOutputs。我正在处理的数据集约为 270 mb,我在我的伪分布式单节点上运行它。我为我的地图输出值使用了自定义可写。键是数据集中存在的国家。
public class reduce_class extends Reducer<Text, name, NullWritable, Text> {
public void reduce(Text key,Iterable<name> values,Context context) throws IOException, InterruptedException{
MultipleOutputs<NullWritable,Text> m = new MultipleOutputs<NullWritable,Text>(context);
long pat;
String n;
NullWritable out = NullWritable.get();
TreeMap<Long,ArrayList<String>> map = new TreeMap<Long,ArrayList<String>>();
for(name nn : values){
pat = nn.patent_No.get();
if(map.containsKey(pat))
map.get(pat).add(nn.getName().toString());
else{
map.put(pat,(new ArrayList<String>()));
map.get(pat).add(nn.getName().toString());}
}
for(Map.Entry entry : map.entrySet()){
n = entry.getKey().toString();
m.write(out, new Text("--------------------------"), key.toString());
m.write(out, new Text(n), key.toString());
ArrayList<String> names = (ArrayList)entry.getValue();
Iterator i = names.iterator();
while(i.hasNext()){
n = (String)i.next();
m.write(out, new Text(n), key.toString());
}
m.write(out, new Text("--------------------------"), key.toString());
}
m.close();
}
}
以上是我的减少逻辑
问题
1) 上述代码适用于小型数据集,但由于堆空间为 270 mb 数据集而失败。
2) 使用国家作为键在单个可迭代集合中传递相当大的值。我试图解决这个问题,但 MutlipleOutputs 为给定的一组键创建唯一文件。重点是我无法附加之前运行reduce创建的现有文件并引发错误。因此对于特定的键,我必须创建新文件。有没有办法解决这个问题?. 解决上述错误导致我将键定义为国家名称(我的最终排序数据)但抛出 java heap error 。
样本输入
3858241,"Durand","Philip","E.","","","Hudson","MA","US","",1 3858241,"Norris","Lonnie","H. ","","","Milford","MA","US","",2 3858242,"Gooding","Elwyn","R.","","120 Darwin Rd."," Pinckney","MI","US","48169",1 3858243,"Pierron","Claude","Raymond","","","Epinal","","FR","", 1 3858243,"Jenny","Jean","Paul","","","Decines","","FR","",2 3858243,"Zuccaro","Robert",""," ","","Epinal","","FR","",3 3858244,"Mann","Richard","L.","","邮政信箱 69","伍德斯托克","CT","美国","06281",1
小数据集的样本输出
示例目录结构...
CA-r-00000
FR-r-00000
魁北克-r-00000
TX-r-00000
美国-r-00000
*个别内容*
3858241 菲利普·E·杜兰德
朗尼·H·诺里斯
3858242
埃尔文·R·古丁
3858244