hadoop - 为什么不使用 hadoop TeraSort 的 mapper/reducer

Question

我打算在 Hadoop 0.20.2 中的 TeraSort 类的映射器中插入一些代码。但是，在查看源代码后，我无法找到实现映射器的段。通常，我们会看到一个名为 job.setMapperClass() 的方法，它指示映射器类。但是，对于 TeraSort，我只能看到 setInputformat、setOutputFormat 之类的东西。我找不到调用 mapper 和 reduce 方法的位置？任何人都可以提供一些提示吗？谢谢，源代码是这样的，

public int run(String[] args) throws Exception {
   LOG.info("starting");
   JobConf job = (JobConf) getConf();
   Path inputDir = new Path(args[0]);
   inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
   Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
   URI partitionUri = new URI(partitionFile.toString() +
                           "#" + TeraInputFormat.PARTITION_FILENAME);
   TeraInputFormat.setInputPaths(job, new Path(args[0]));
   FileOutputFormat.setOutputPath(job, new Path(args[1]));
   job.setJobName("TeraSort");
   job.setJarByClass(TeraSort.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(Text.class);
   job.setInputFormat(TeraInputFormat.class);
   job.setOutputFormat(TeraOutputFormat.class);
   job.setPartitionerClass(TotalOrderPartitioner.class);
   TeraInputFormat.writePartitionFile(job, partitionFile);
   DistributedCache.addCacheFile(partitionUri, job);
   DistributedCache.createSymlink(job);
   job.setInt("dfs.replication", 1);
   // TeraOutputFormat.setFinalSync(job, true);                                                                                                                                                                                             
   job.setNumReduceTasks(0);
   JobClient.runJob(job);
   LOG.info("done");
   return 0;
 }

对于其他类，比如 TeraValidate，我们可以找到类似的代码，

job.setMapperClass(ValidateMapper.class);
job.setReducerClass(ValidateReducer.class);

我看不到 TeraSort 的这种方法。

谢谢，

score 3 · Accepted Answer

为什么排序需要为它设置Mapper和Reducer类？

默认值为标准Mapper（以前的身份映射器）和标准Reducer。这些是您通常继承的类。

您基本上可以说，您只是从输入中发出所有内容，并让 Hadoop 进行自己的排序工作。所以排序是“默认”的。

score 1 · Accepted Answer

Thomas 的回答是正确的，即映射器和化简器是同一性的，因为在应用您的化简函数之前对混洗数据进行了排序。terasort 的特别之处在于它的自定义分区器（它不是默认的哈希函数）。你应该从这里Hadoop's implementation for Terasort阅读更多关于它的信息。它指出

“TeraSort 是一个标准的 map/reduce 排序，除了一个自定义分区器，它使用 N - 1 个采样键的排序列表，定义每个 reduce 的键范围。特别是，所有键，例如 sample[i - 1] <= key < sample[i] 被发送到reduce i。这保证reduce i的输出都小于reduce i+1的输出。

hadoop - 为什么不使用 hadoop TeraSort 的 mapper/reducer

2 回答 2

Related

Reference