hadoop - 无法在 Reducer 中对数据进行分组

Question

我正在尝试编写一个 MapReduce 应用程序，其中 Mapper 将一组值传递给 Reducer，如下所示：

你好
世界
你好
你好
世界
你好

现在这些值将首先被分组和计数，然后进行一些进一步的处理。我写的代码是：

public void reduce(Text key, Iterable<Text> values, Context context) 
        throws IOException, InterruptedException {

    List<String> records = new ArrayList<String>();

    /* Collects all the records from the mapper into the list. */
    for (Text value : values) {
        records.add(value.toString());
    }
    /* Groups the values. */
    Map<String, Integer> groupedData = groupAndCount(records);
    Set<String> groupKeys = groupedData.keySet();

    /* Writes the grouped data. */
    for (String groupKey : groupKeys) {
        System.out.println(groupKey + ": " + groupedData.get(groupKey));
        context.write(NullWritable.get(), new Text(groupKey + groupedData.get(groupKey)));
    }
}

public Map<String, Integer> groupAndCount(List<String> records) {
    Map<String, Integer> groupedData = new HashMap<String, Integer>();
    String currentRecord = "";

    Collections.sort(records);
    for (String record : records) {
        System.out.println(record);

        if (!currentRecord.equals(record)) {
            currentRecord = record;
            groupedData.put(currentRecord, 1);
        } else {
            int currentCount = groupedData.get(currentRecord);
            groupedData.put(currentRecord, ++currentCount);
        }
    }

    return groupedData;
}

但是在输出中，我得到的计数都是 1。sysout 语句打印如下：

Hello
World
Hello: 1
World: 1
Hello
Hello: 1
Hello
World
Hello: 1
World: 1
Hi
Hi: 1

我不明白问题是什么，为什么不是所有记录都被 Reducer 立即接收并传递给该groupAndCount方法。

score 0 · Accepted Answer

正如您在评论中指出的那样，如果每个值都有不同的对应键，那么它们不会在同一个 reduce 调用中减少，您将获得当前看到的输出。

Hadoop reducer 的基础是对于同一个键将收集和减少值的概念 - 我建议您重新阅读一些 Hadoop 入门文档，尤其是 Word Count 示例，这似乎大致是您想要实现的目标用你的代码。

hadoop - 无法在 Reducer 中对数据进行分组

1 回答 1

Related

Reference