hadoop - 如何在整个文件处理结束时发出键值？

Question

映射器从文件中读取行...在整个扫描文件而不是每行之后，我如何才能最终发出键值？

score 2 · Accepted Answer

使用新的 mapreduce API，您可以覆盖该Mapper.cleanup(Context)方法并Context.write(K, V)像往常在 map 方法中一样使用。

@Override
protected void cleanup(Context context) {
  context.write(new Text("key"), new Text("value"));
}

您可以覆盖旧的 mapred API close()- 但您需要存储对OutputCollector给定 map 方法的引用：

private OutputCollector cachedCollector = null;

void map(Longwritable key, Text value, OutputCollector outputCollector, Reporter reporter) {
  if (cachedCollector == null) {
    cachedCollector = outputCollector;
  }

  // ...
}

public void close() {
  cachedCollector.collect(outputKey, outputValue);
}

score 0 · Accepted Answer

克里斯的答案的另一种选择可能是您可以通过覆盖run()Mapper 类（新 API）来实现这一点

public static class Map extends Mapper<IntWritable, IntWritable, IntWritable, IntWritable> {

  //map method here

  // Override the run()
  @override
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    while (context.nextKeyValue()) {
      map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
  // Have your last <key,value> emitted here
  context.write(lastOutputKey, lastOutputValue);
  cleanup(context);
  }
}

为了确保每个映射器都能处理一个文件，您必须创建自己的版本FileInputFormat和覆盖 isSplittable()，如下所示：

Class NonSplittableFileInputFormat extends FileInputFormat{

@Override 
    public boolean isSplitable(FileSystem fs, Path filename){ 
        return false; 
    }
}

score 0 · Accepted Answer

您对整个文件或多个文件有一个键值吗？

如果是案例 #1：使用 WholeFileInputFormat。您将收到作为单个记录的完整文件内容。您可以将其拆分为记录，处理所有记录并在处理结束时发出最终键/值

Cae #2：使用相同的文件输入格式。将所有键值存储在临时存储中。最后，访问您的临时存储并发出您想要的任何键/值并抑制您不想要的那些

hadoop - 如何在整个文件处理结束时发出键值？

3 回答 3

Related

Reference