amazon-web-services - 将压缩（lzo）数据从 s3 导入 hive

Question

我将 DynamoDB 表导出到 s3 作为备份方式（通过 EMR）。导出时，我将数据存储为 lzo 压缩文件。我的配置单元查询如下，但基本上我遵循了http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html上的“使用数据压缩将 Amazon DynamoDB 表导出到 Amazon S3 存储桶”

我现在想做相反的事情 - 拿我的 LZO 文件并将它们放回蜂巢表中。你怎么做到这一点？我期待看到一些用于输入的配置单元配置属性，但没有。我用谷歌搜索并找到了一些提示，但没有确定的，也没有任何有效的。

s3 中的文件格式为：s3://[mybucket]/backup/year=2012/month=08/day=01/000000.lzo

这是我进行导出的 HQL：

SET dynamodb.throughput.read.percent=1.0;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;      

CREATE EXTERNAL TABLE hiveSBackup (id bigint, periodStart string, allotted bigint, remaining bigint, created string, seconds bigint, served bigint, modified string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("dynamodb.table.name" = "${DYNAMOTABLENAME}", 
"dynamodb.column.mapping" = "id:id,periodStart:periodStart,allotted:allotted,remaining:remaining,created:created,seconds:seconds,served:served,modified:modified");

CREATE EXTERNAL TABLE s3_export (id bigint, periodStart string, allotted bigint, remaining bigint, created string, seconds bigint, served bigint, modified string)
 PARTITIONED BY (year string, month string, day string)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 LOCATION 's3://<mybucket>/backup';

INSERT OVERWRITE TABLE s3_export
 PARTITION (year="${PARTITIONYEAR}", month="${PARTITIONMONTH}", day="${PARTITIONDAY}")
 SELECT * from hiveSBackup;

任何想法如何从 s3 中获取它，解压缩并进入配置单元表？

score 6 · Accepted Answer

EMR 上的 Hive 可以直接从 S3 本地读取数据，您无需导入任何内容。您只需要创建一个外部表并告诉它数据在哪里。它还具有 lzo 支持设置。如果文件以 .lzo 扩展名结尾，Hive 将使用 lzo 自动解压缩。

因此，要将 s3 中的 lzo 数据“导入”到 hive 中，您只需创建一个指向 lzo 压缩数据 s3 的外部表，并且 hive 将在对其运行查询时对其进行解压缩。与“导出”数据时所做的几乎完全相同。那个 s3_export 表，你也可以从中读取。

如果要将其导入非外部表，只需将覆盖插入新表并从外部表中选择即可。

除非我误解了您的问题，并且您的意思是要询问有关导入发电机的问题，而不仅仅是蜂巢表？

This is what I've been doing
SET hive.exec.compress.output=true; 
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;

CREATE EXTERNAL TABLE users
(id int, username string, firstname string, surname string, email string, birth_date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://bucket/someusers';

INSERT OVERWRITE TABLE users
SELECT * FROM someothertable;

我最终在 s3://bucket/someusers 下得到了一堆文件，这些文件具有 .lzo 扩展名，这些文件可由 hive 读取。

您只需在尝试写入压缩数据时设置编解码器，读取它会自动检测到压缩。

amazon-web-services - 将压缩（lzo）数据从 s3 导入 hive

1 回答 1

Related

Reference