amazon-web-services - Snappy 中的 Hive 压缩兽人

Question

使用：Amazon Aws Hive (0.13)
尝试：输出具有快速压缩的 orc 文件。

create external table output{
col1 string}
partitioned by (col2 string)
stored as orc
location 's3://mybucket'
tblproperties("orc.compress"="SNAPPY");

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.compress.output = true;    
set mapred.output.compression.type = BLOCK;  
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;

insert into table output
partition(col2)
select col1,col2 from input;

问题是，当我查看 mybucket 目录中的输出时，它不是带有 SNAPPY 扩展名的。但是，它是一个二进制文件。我错过了什么设置来将这些 orc 文件转换为压缩并使用 SNAPPY 扩展名输出？

score 3 · Accepted Answer

OrcFiles 是一种特殊格式的二进制文件。当您指定orc.compress = SNAPPY文件的内容时使用 Snappy 进行压缩。Orc 是一种半柱状文件格式。

查看此文档以获取有关数据布局方式的更多信息。

流使用编解码器进行压缩，该编解码器被指定为该表中所有流的表属性。为了优化内存使用，随着每个块的生成，压缩是增量完成的。可以跳过压缩块，而无需首先解压缩以进行扫描。流中的位置由块开始位置和块中的偏移量表示。

简而言之，您的文件是使用 Snappy 编解码器压缩的，您只是无法分辨它们是因为文件中的块是实际压缩的。

score 3 · Accepted Answer

此外，您可以使用hive --orcfiledump /apps/hive/warehouse/orc/000000_0查看文件的详细信息。输出将如下所示：

Reading ORC rows from /apps/hive/warehouse/orc/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 6
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:int>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 6
    Column 1: count: 6 min: Beth max: Owen sum: 29
    Column 2: count: 6 min: 1 max: 6 sum: 21

File Statistics:
  Column 0: count: 6
  Column 1: count: 6 min: Beth max: Owen sum: 29
  Column 2: count: 6 min: 1 max: 6 sum: 21
....

amazon-web-services - Snappy 中的 Hive 压缩兽人

2 回答 2

Related

Reference