4

请向我解释或提供有关分发者在蜂巢中的真正作用的链接?它如何控制文件发送到特定的减速器?

4

3 回答 3

14

DISTRIBUTE BY controls how map output is divided among reducers. By default, MapReduce computes a hash on the keys output by mappers and tries to evenly distribute the key-value pairs among the available reducers using the hash values. Say we want the data for each value in a column to be captured together. We can use DISTRIBUTE BY to ensure that the records for each go to the same reducer. DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers receive rows for processing, Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause if it's in same query .

于 2013-09-23T06:16:44.017 回答
3

DISTRIBUTE BY是一个很好的解决方法,可以在您有内存密集型作业时使用更少的内存,并强制 Hadoop 使用 Reducers 而不是仅使用 Map 的作业。本质上,Mappers 根据指定的DISTRIBUTE BY列对行进行一些分组,这使得框架总体上减少了工作量,并将这些聚合传递给 Reducers。

请参阅https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

于 2014-04-16T05:54:50.430 回答
0

您可以在此处查看官方文档。

于 2013-09-23T10:31:09.353 回答