我通过 AWS EMR 运行 Hive,并有一个将日志数据频繁解析到 S3 的工作流。我为解析的 Hive 表使用动态分区(日期和日志级别)。
当我有几 GB 的数据和很多分区时,现在需要永远做的一件事是 Hive 在解析完成后将数据加载到表中。
Loading data to table default.logs partition (dt=null, level=null)
...
Loading partition {dt=2013-08-06, level=INFO}
Loading partition {dt=2013-03-12, level=ERROR}
Loading partition {dt=2013-08-03, level=WARN}
Loading partition {dt=2013-07-08, level=INFO}
Loading partition {dt=2013-08-03, level=ERROR}
...
Partition default.logs{dt=2013-03-05, level=INFO} stats: [num_files: 1, num_rows: 0, total_size: 1905, raw_data_size: 0]
Partition default.logs{dt=2013-03-06, level=ERROR} stats: [num_files: 1, num_rows: 0, total_size: 4338, raw_data_size: 0]
Partition default.logs{dt=2013-03-06, level=INFO} stats: [num_files: 1, num_rows: 0, total_size: 828250, raw_data_size: 0]
...
Partition default.logs{dt=2013-08-14, level=INFO} stats: [num_files: 5, num_rows: 0, total_size: 626629, raw_data_size: 0]
Partition default.logs{dt=2013-08-14, level=WARN} stats: [num_files: 4, num_rows: 0, total_size: 4405, raw_data_size: 0]
有没有办法克服这个问题并减少这一步的加载时间?
我已经尝试通过存储桶生命周期规则将旧日志存档到 Glacier,希望 Hive 会跳过加载存档分区。好吧,由于这仍然使文件(路径)在 S3 Hive 中可见,因此无论如何都可以识别存档分区,因此不会获得任何性能。
更新 1
数据的加载是通过简单地将数据插入到动态分区表中来完成的
INSERT INTO TABLE logs PARTITION (dt, level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, dt, level
FROM new_logs ;
来自包含未解析日志的一张表
CREATE EXTERNAL TABLE new_logs (
dt STRING,
time STRING,
thread STRING,
level STRING,
logger STRING,
identity STRING,
message STRING,
logtype STRING,
logsubtype STRING,
node STRING,
storageallocationstatus STRING,
nodelist STRING,
userid STRING,
nodeid STRING,
path STRING,
datablockid STRING,
hash STRING,
size STRING,
value STRING,
exception STRING,
version STRING
)
PARTITIONED BY (
server STRING,
app STRING
)
ROW FORMAT
DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'org.maz.hadoop.mapred.LogFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://my-log/logs/${LOCATION}' ;
进入新的(解析的)表
CREATE EXTERNAL TABLE logs (
time STRING,
thread STRING,
logger STRING,
identity STRING,
message STRING,
logtype STRING,
logsubtype STRING,
node STRING,
storageallocationstatus STRING,
nodelist STRING,
userid STRING,
nodeid STRING,
path STRING,
datablockid STRING,
hash STRING,
size STRING,
exception STRING,
value STRING,
server STRING,
app STRING,
version STRING
)
PARTITIONED BY (
dt STRING,
level STRING
)
ROW FORMAT
DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://my-log/parsed-logs' ;
输入格式(LogFileInputFormat)负责将日志条目解析为所需的日志格式。
更新 2
当我尝试以下
INSERT INTO TABLE logs PARTITION (dt, level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, dt, level
FROM new_logs
WHERE dt > 'some old date';
Hive 仍然在日志中加载所有分区。另一方面,如果我使用静态分区,例如
INSERT INTO TABLE logs PARTITION (dt='some date', level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, level
FROM new_logs
WHERE dt = 'some date';
Hive 仅加载相关分区,但随后我需要为我认为可能出现在 new_logs 中的每个日期创建一个查询。通常 new_logs 仅包含今天和昨天的日志条目,但也可能包含较旧的条目。
静态分区是我目前选择的解决方案,但我的问题没有其他(更好的)解决方案吗?