regex - Hive RegexSerDe 多行日志匹配

Question

我正在寻找一个正则表达式，它可以以以下形式提供给 Hive QL 的“创建外部表”语句

"input.regex"="the regex goes here"

条件是 RegexSerDe 必须读取的文件中的日志格式如下：

2013-02-12 12:03:22,323 [DEBUG] 2636hd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks. This one does not have a linebreak. It just has spaces on the same line.
2013-02-12 12:03:24,527 [DEBUG] 265y7d3e-432g-dfg3-dwq3-y4dsfq3ew91b Some other message that can contain any special character, including linebreaks. This one does not have one either. It just has spaces on the same line.
2013-02-12 12:03:24,946 [ERROR] 261rtd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks.
 This is a special one.
 This has a message that is multi-lined.
 This is line number 4 of the same log.
 Line 5.
2013-02-12 12:03:24,988 [INFO] 2632323e-432g-dfg3-dwq3-y4dsfq3ew91b Another 1-line log
2013-02-12 12:03:25,121 [DEBUG] 263tgd3e-432g-dfg3-dwq3-y4dsfq3ew91b Yet another one line log.

我正在使用以下创建外部表代码：

CREATE EXTERNAL TABLE applogs (logdatetime STRING, logtype STRING, requestid STRING, verbosedata STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "(\\A[[0-9:-] ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) (.*)?(?=(?:\\A[[0-9:-] ]{19},[0-9]|\\z))",
"output.format.string" = "%1$s \\[%2$s\\] %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION 'hdfs:///logs-application';

事情是这样的：

它能够提取每个日志的所有第一行。但不是其他多行的日志行。我尝试了所有链接，最后替换\z为\Z，替换\A为^and\Z或\z，$没有任何效果。我在 output.format.string 中遗漏了什么%4$s吗？还是我没有正确使用正则表达式？

正则表达式的作用：

它首先匹配时间戳，然后是日志类型（DEBUG或INFO其他），然后是ID（小写字母、数字和连字符的混合），然后是任何东西，直到找到下一个时间戳，或者直到找到输入的结尾匹配最后一个日志条目。我还尝试/m在最后添加，在这种情况下，生成的表具有所有 NULL 值。

score 1 · Accepted Answer

您的正则表达式似乎存在许多问题。

首先，删除你的双方括号。

其次，\A和\Z/\z是匹配输入的开始和结束，而不仅仅是一行。更改\A为^以匹配行首，但不要更改\z为，$因为在这种情况下您实际上想要匹配输入结束。

第三，你要匹配(.*?)，不要(.*)?。第一个模式是不贪婪的，而第二个模式是贪婪的但可选的。它应该与您的整个输入匹配到最后，因为您允许它后面跟着输入结束。

第四，.不匹配换行符。您可以使用任何一对免费匹配(\s|\S)来代替，或([x]|[^x])，等等。

第五，如果它给你单行匹配\A和\Z/\z那么输入也是单行，因为你锚定了整个字符串。

我建议尝试匹配 just \n，如果没有匹配项，则不包括换行符。

您不能添加/m到末尾，因为正则表达式不包含分隔符。它会尝试匹配文字字符/m，这就是你没有匹配的原因。

如果它要工作，你想要的正则表达式是：

"^([0-9:- ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) ([\\s\\S]*?)(?=\\r?\\n([0-9:-] ){19},[0-9]|\\r?\\z)"

分解：

^([0-9:- ]{19},[0-9]{3})

匹配换行符的开头和 19 个字符，即数字、:或-加上逗号、三位数字和空格。捕获除最后一个空间（时间戳）之外的所有空间。

(\\[[A-Z]*\\])

匹配文字[，任意数量的大写字母，甚至没有，文字]和空格。捕获除最后一个空间之外的所有空间（错误级别）。

([0-9a-z-]*)

匹配任意数量的数字、小写字母或-和空格。捕获除最后一个空间（消息 id）之外的所有空间。

([\\s\\S]*?)(?=\\r?\\n([0-9:-] ){19},[0-9]|\\r?\\Z)

匹配任何空白或非空白字符（任何字符）但匹配 ungreedy *?。当新记录或输入结尾( \Z) 紧随其后时停止匹配。在这种情况下，您不想再次匹配行尾，您只会在输出中得到一行。捕获除最后（消息文本）之外的所有内容。\r?\n是跳过消息末尾的最后一个换行符，就像\r?\Z. 你也可以写\r?\n\z注：大写\Z包括输入末尾的最后一个换行符（如果有的话）。仅在输入结束时匹配小写字母\z，而不是在输入结束前匹配换行符。我添加\z?了以防万一您必须处理 Windows 行尾，但是，我认为这不是必需的。

但是，我怀疑除非您可以一次输入整个文件而不是逐行输入，否则这也不起作用。

您可以尝试的另一个简单测试是：

"^([\\s\\S]+)^\\d"

如果它有效，它将匹配任何整行，然后是下一行的行数字（时间戳的第一个数字）。

score 1 · Accepted Answer

遵循 Java 正则表达式可能会有所帮助：

(\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},\d{1,3})\s+(\[.+?\])\s+(.+?)\s+([\s\S\s]+?)(?=\d{4}-\d{1,2}-\d{1,2}|\Z)

分解：

第一捕获组(\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},\d{1,3})
抓拍二组(\[.+?\])
第三捕获组(.+?)
第四抓捕组([\s\S]+?)。

(?=\d{4}-\d{1,2}-\d{1,2}|\Z)肯定的前瞻 - 断言下面的正则表达式可以匹配。第一个替代方案：\d{4}-\d{1,2}-\d{1,2}.2nd 替代方案：\Z断言字符串末尾的位置。

参考http://regex101.com/

score 0 · Accepted Answer

我对 Hive 了解不多，但以下正则表达式或为 Java 字符串格式化的变体可能会起作用：

(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d,\d+) \[([a-zA-Z_-]+)\] ([\w-]+) ((?:[^\n\r]+)(?:[\n\r]{1,2}\s[^\n\r]+)*)

可以在此处看到与您的示例数据匹配：

http://rubular.com/r/tQp9iBp4JI

细分：

(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d,\d+)日期和时间（捕获组 1）
\[([a-zA-Z_-]+)\]日志级别（捕获组 2）
([\w-]+)请求 ID（捕获组 3）
((?:[^\n\r]+)(?:[\n\r]{1,2}\s[^\n\r]+)*)潜在的多行消息（捕获组 4）

前三个捕获组非常简单。

最后一个可能有点奇怪，但它适用于 rubular。细分：

(                       Capture it as one group
    (?:[^\n\r]+)        Match to the end of the line, dont capture
    (?:                 Match line by line, after the first, but dont capture
        [\n\r]{1,2}     Match the new-line
        \s              Only lines starting with a space (this prevents new log-entries from matching)
        [^\n\r]+        Match to the end of the line            
    )*                  Match zero or more of these extra lines
)

我使用[^\n\r]而不是，.因为它看起来像RegexSerDe让.匹配新行（链接）：

// Excerpt from https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java#L101
if (inputRegex != null) {
  inputPattern = Pattern.compile(inputRegex, Pattern.DOTALL
      + (inputRegexIgnoreCase ? Pattern.CASE_INSENSITIVE : 0));
} else {
  inputPattern = null;
}

希望这可以帮助。

regex - Hive RegexSerDe 多行日志匹配

3 回答 3

Related

Reference