hadoop - 内存中的 Hadoop Reducer 值？

Question

我正在编写一个 MapReduce 作业，最终可能会在 reducer 中产生大量值。我担心所有这些值会立即加载到内存中。

将值加载到内存中的底层实现是否Iterable<VALUEIN> values需要？Hadoop：权威指南似乎暗示了这种情况，但没有给出“明确”的答案。

reducer 的输出将比输入的值大得多，但我相信输出会根据需要写入磁盘。

score 15 · Accepted Answer

你在正确地阅读这本书。reducer 不会将所有值都存储在内存中。相反，当循环遍历 Iterable 值列表时，每个 Object 实例都被重复使用，因此它在给定时间只保留一个实例。

例如，在下面的代码中，objs ArrayList 在循环后将具有预期的大小，但每个元素都将是相同的 b/c 每次迭代都会重复使用 Text val 实例。

public static class ReducerExample extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) {
    ArrayList<Text> objs = new ArrayList<Text>();
            for (Text val : values){
                    objs.add(val);
            }
    }
}

（如果出于某种原因您确实想对每个 val 采取进一步的措施，您应该制作一个深拷贝然后存储它。）

当然，即使是单个值也可能大于内存。在这种情况下，建议开发人员采取措施减少前面 Mapper 中的数据，以使值不会太大。

更新：参见 Hadoop The Definitive Guide 2nd Edition 的第 199-200 页。

This code snippet makes it clear that the same key and value objects are used on each 
invocation of the map() method -- only their contents are changed (by the reader's 
next() method). This can be a surprise to users, who might expect keys and vales to be 
immutable. This causes prolems when a reference to a key or value object is retained 
outside the map() method, as its value can change without warning. If you need to do 
this, make a copy of the object you want to hold on to. For example, for a Text object, 
you can use its copy constructor: new Text(value).

The situation is similar with reducers. In this case, the value object in the reducer's 
iterator are reused, so you need to copy any that you need to retain between calls to 
the iterator.

score 2 · Accepted Answer

它不完全在内存中，其中一些来自磁盘，看代码似乎框架将 Iterable 分成段，并将它们从磁盘一个一个加载到内存中。

org.apache.hadoop.mapreduce.task.ReduceContextImpl org.apache.hadoop.mapred.BackupStore

score 0 · Accepted Answer

正如其他用户所引用的那样，整个数据并未加载到内存中。查看Apache文档链接中的一些 mapred-site.xml 参数。

mapreduce.reduce.merge.inmem.threshold

默认值：1000。它是阈值，以内存中合并过程的文件数表示。

mapreduce.reduce.shuffle.merge.percent

默认值为 0.66。将启动内存中合并的使用阈值，表示为分配给存储内存中映射输出的总内存的百分比，如mapreduce.reduce.shuffle.input.buffer.percent.

mapreduce.reduce.shuffle.input.buffer.percent

默认值为 0.70。在 shuffle 期间从最大堆大小分配到存储映射输出的内存百分比。

mapreduce.reduce.input.buffer.percent

默认值为 0。内存百分比（相对于最大堆大小）在减少期间保留映射输出。当 shuffle 结束时，内存中任何剩余的 map 输出必须消耗少于这个阈值，然后才能开始 reduce。

mapreduce.reduce.shuffle.memory.limit.percent

默认值为：0.25。单个 shuffle 可以消耗的内存限制的最大百分比

hadoop - 内存中的 Hadoop Reducer 值？

3 回答 3

Related

Reference