hadoop - 如何减少 Apache Hive 中的输出文件数量

Question

有谁知道可以将 Apache Hadoop 的输出文件“压缩”成更少文件或一个文件的工具。目前我正在将所有文件下载到本地机器并将它们连接到一个文件中。有没有人知道 API 或做同样事情的工具。提前致谢。

score 4 · Accepted Answer

限制输出文件的数量意味着您要限制减速器的数量。mapred.reduce.tasks你可以在 Hive shell 的属性的帮助下做到这一点。例子：

hive>  set mapred.reduce.tasks = 5;

但这可能会影响查询的性能。或者，您可以getmerge在完成查询后使用 HDFS shell 中的命令。此命令将源目录和目标文件作为输入，并将 src 中的文件连接到目标本地文件中。

用法：

bin/hadoop fs -getmerge <src> <localdst>

高温高压

score 1 · Accepted Answer

请参阅https://community.cloudera.com/t5/Support-Questions/Hive-Multiple-Small-Files/td-p/204038

set hive.merge.mapfiles=true;     -- Merge small files at the end of a map-only job.
set hive.merge.mapredfiles=true;  -- Merge small files at the end of a map-reduce job.

set hive.merge.size.per.task=???; -- Size (bytes) of merged files at the end of the job.

set hive.merge.smallfiles.avgsize=??? -- File size (bytes) threshold
-- When the average output file size of a job is less than this number, 
-- Hive will start an additional map-reduce job to merge the output files 
-- into bigger files. This is only done for map-only jobs if hive.merge.mapfiles 
-- is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

hadoop - 如何减少 Apache Hive 中的输出文件数量

2 回答 2

Related

Reference