java - 什么会导致 Hadoop 跳过排序步骤？

Question

我正在尝试使用 Hadoop 对一个非常大的数据集进行格式化和排序，但它似乎跳过了排序步骤。映射器将 Avro 输入文件转换为 JSON 中的一些有趣字段。

void map(AvroWrapper<Datum> wrappedAvroDatum, NullWritable nothing,
         OutputCollector<Text, Text> collector, Reporter reporter) {
    Datum datum = wrappedAvroDatum.datum();
    if (interesting(datum)) {
        Long time = changeTimeZone(datum.getTime());
        String key = "%02d".format(month(time));
        String value = "{\"time\": %d, \"other-stuff\": %s, ...}".format(time, datum.getOtherStuff());
        collector.collect(new Text(key), new Text(value));
    }
}

reducer 假设每个键的值按字典顺序排列（适用于org.apache.hadoop.io.Text，对吗？），然后只删除键，以便我得到一个文本文件，每行一个 JSON 对象。

void reduce(Text key, java.util.Iterator<Text> values,
            OutputCollector<NullWritable, Text> collector, Reporter reporter) {
    while (values.hasNext()) {
        collector.collect(NullWritable.get, new Text(values.next()));
    }
}

我希望文本文件以一个月为单位进行排序（也就是说，我不希望月份按顺序排列，但我希望每个月内的时间按顺序排列）。我得到的是按月分组但完全未排序的文本文件。显然，Hadoop 是Text按键值对记录进行分组，但没有对它们进行排序。

（已知问题：我依赖"time"于我的 JSON 对象中首先出现的事实，并且所有记录的位数完全相同，因此字典顺序是数字顺序。这对我的数据来说是正确的。）

当我使用Hadoop Streaming（在这个项目中不是一个选项）时，文本行会自动排序——可以配置排序，但默认情况下它会按照我的意愿进行。在原始 Hadoop 中，是否需要以某种方式打开排序？如果是这样，怎么做？如果它应该默认打开，我可以从哪里开始寻找调试这个问题？

我在 Cloudera 的 CDH4 Hadoop-0.20 包中以伪分布式模式和 Amazon 的 Elastic Map-Reduce (EMR) 观察到这种行为。

score 2 · Accepted Answer

Hadoop sorts the keys, not the values. This means the results you are getting are correct. Hadoop has not skipped the sort phase; it is actually sorting the keys.

You could design your own Writable type to use a composite key and ensure the type of sorting you want. This other SO question explains how to do this.

Finally, this other SO question gives more information on how the shuffle & sort phase works in Hadoop.

java - 什么会导致 Hadoop 跳过排序步骤？

1 回答 1

Related

Reference