我有不一致的日志文件,我想使用动态分区对 Hive 进行分区。文件示例:
20/06/13 20:21:42.637 FLW CPTView::OnInitialUpdate nRemoveAppShareQSize0=50000\n
20/06/13 20:21:42.638 FLW \n
BandwidthGlobalSettings:Old Bandwidth common defined\n
有时日志文件包含以某些与日期不同的单词开头的行。每行以 \n 分隔。
我正在运行命令:
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_temp (date STRING,time STRING,severity STRING,message STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/tmp';
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_partitioned (time STRING,severity STRING,message STRING) PARTITIONED BY (date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/partitions';
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM log_messages_temp pvs INSERT OVERWRITE TABLE log_messages_partitioned PARTITION(date) SELECT pvs.time, pvs.severity, pvs.message, pvs.date;
结果创建了两个动态分区:date=20/06/13 和 date=BandwidthGlobalSettings:Old
我想定义 Hive 以忽略以非日期字符串开头的行。
我怎样才能做到这一点?或者也许存在另一种解决方案?谢谢。