我正在寻找一个正则表达式,它可以以以下形式提供给 Hive QL 的“创建外部表”语句
"input.regex"="the regex goes here"
条件是 RegexSerDe 必须读取的文件中的日志格式如下:
2013-02-12 12:03:22,323 [DEBUG] 2636hd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks. This one does not have a linebreak. It just has spaces on the same line.
2013-02-12 12:03:24,527 [DEBUG] 265y7d3e-432g-dfg3-dwq3-y4dsfq3ew91b Some other message that can contain any special character, including linebreaks. This one does not have one either. It just has spaces on the same line.
2013-02-12 12:03:24,946 [ERROR] 261rtd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks.
This is a special one.
This has a message that is multi-lined.
This is line number 4 of the same log.
Line 5.
2013-02-12 12:03:24,988 [INFO] 2632323e-432g-dfg3-dwq3-y4dsfq3ew91b Another 1-line log
2013-02-12 12:03:25,121 [DEBUG] 263tgd3e-432g-dfg3-dwq3-y4dsfq3ew91b Yet another one line log.
我正在使用以下创建外部表代码:
CREATE EXTERNAL TABLE applogs (logdatetime STRING, logtype STRING, requestid STRING, verbosedata STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "(\\A[[0-9:-] ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) (.*)?(?=(?:\\A[[0-9:-] ]{19},[0-9]|\\z))",
"output.format.string" = "%1$s \\[%2$s\\] %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION 'hdfs:///logs-application';
事情是这样的:
它能够提取每个日志的所有第一行。但不是其他多行的日志行。我尝试了所有链接,最后替换\z
为\Z
,替换\A
为^
and\Z
或\z
,$
没有任何效果。我在 output.format.string 中遗漏了什么%4$s
吗?还是我没有正确使用正则表达式?
正则表达式的作用:
它首先匹配时间戳,然后是日志类型(DEBUG
或INFO
其他),然后是ID
(小写字母、数字和连字符的混合),然后是任何东西,直到找到下一个时间戳,或者直到找到输入的结尾匹配最后一个日志条目。我还尝试/m
在最后添加 ,在这种情况下,生成的表具有所有 NULL 值。