json - Hive 为 Fluentd Apache 日志数据创建表语句

Question

我正在使用 Fluentd 在 HDFS 中捕获和整合 Apache 日志数据。我配置代理将数据写入HDFS，即/etc/td-agent/td-agent.conf文件包含：

<source>
  type tail
  path /var/log/httpd/access_log
  pos_file /var/log/td-agent/httpd-access.log.pos
  tag apache.access
  format apache2
</source>

<match apache.access>
  type webhdfs
  host fqdn.of.name.node
  port 50070
  path /data/access_logs/access.log.%Y%m%d_%H.${hostname}.log
  flush_interval 10s
</match>

我根据Fluentd 文档启用了 HDFS 附加功能。数据完美地流过。在过去的几周里，它一直在无故障地传输数百万笔交易。

数据存储在包含如下行的文件中：

2015-01-10T17:00:00Z    apache.access   {"host":"155.96.21.4","user":null,"method":"GET","path":"/somepage/index.html","code":200,"size":8192,"referer":null,"agent":"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET4.0C; .NET4.0E)"}

每行包含三个制表符分隔的元素：

时间戳
识别标签
JSON 包含 Apache 日志中列的键/值对

我正在尝试创建一个 Hive 表，但不确定如何处理它是每行上的制表符分隔字符串和 JSON 混合的事实。我知道 Hive 有一个 JSON 反序列化器，但我认为这不会起作用，因为记录不是纯 JSON。

有没有人建议如何为这些数据编写创建表语句？

score 0 · Accepted Answer

尝试将以下参数添加到您的 out_wedhdfs 配置中：

output_data_type json

这应该记录在案。我会尽快更新文档。

json - Hive 为 Fluentd Apache 日志数据创建表语句

1 回答 1

Related

Reference