regex - Regex pattern to parse HttpLog format

Question

I am looking for a regex pattern matcher for a String in HttpLogFormat. The log is generated by haproxy. Below is a sample String in this format.

Feb 6 12:14:14 localhost haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} {} "GET /index.html HTTP/1.1"

An explanation of the format is available at HttpLogFormat. Any help is appreciated.

I am trying to get the individual peices of information included in that line. Here are the fields:

process_name '[' pid ']:'
client_ip ':' client_port
'[' accept_date ']'
frontend_name
backend_name '/' server_name
Tq '/' Tw '/' Tc '/' Tr '/' Tt*
status_code
bytes_read
captured_request_cookie
captured_response_cookie
termination_state
actconn '/' feconn '/' beconn '/' srv_conn '/' retries
srv_queue '/' backend_queue
'{' captured_request_headers* '}'
'{' captured_response_headers* '}'
'"' http_request '"'

score 4 · Accepted Answer

正则表达式：

^(\w+ \d+ \S+) (\S+) (\S+)\[(\d+)\]: (\S+):(\d+) \[(\S+)\] (\S+) (\S+)/(\S+) (\S+) (\S+) (\S+) *(\S+) (\S+) (\S+) (\S+) (\S+) \{([^}]*)\} \{([^}]*)\} "(\S+) ([^"]+) (\S+)" *$

结果：

Group 1:    Feb 6 12:14:14
Group 2:    localhost
Group 3:    haproxy
Group 4:    14389
Group 5:    10.0.1.2
Group 6:    33317
Group 7:    06/Feb/2009:12:14:14.655
Group 8:    http-in
Group 9:    static
Group 10:   srv1
Group 11:   10/0/30/69/109
Group 12:   200
Group 13:   2750
Group 14:   -
Group 15:   -
Group 16:   ----
Group 17:   1/1/1/1/0
Group 18:   0/0
Group 19:   1wt.eu
Group 20:   
Group 21:   GET
Group 22:   /index.html
Group 23:   HTTP/1.1

我使用RegexBuddy编写复杂的正则表达式。

score 2 · Accepted Answer

使用后果自负。

这假设所有字段都返回一些东西，除了你用星号标记的那些（这就是星号的意思）？也有明显的失败案例，例如任何类型的嵌套括号，但如果记录器打印出合理理智的消息，那么我想你会没事的......

当然，即使我个人也不想维护这个，但是你有它。如果可以的话，您可能想考虑为此编写一个常规的 ol' 解析器。

编辑：将此标记为 CW，因为它更像是一种“我想知道这将如何变成”的答案，而不是其他任何东西。为了快速参考，这是我最终用 rubular 构建的：

^[^[]+\s+(\w+)\[(\d+)\]:([^:]+):(\d+)\s+\[([^\]]+)\]\s+[^\s]+\s+(\w+)\/(\w+)\s+(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d*)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d+)\s+(\d+)\/(\d+)\s+\{([^}]*)\}\s\{([^}]*)\}\s+\"([^"]+)\"$

我的第一个编程语言是 Perl，甚至我也愿意承认我对此感到害怕。

score 1 · Accepted Answer

你为什么要精确匹配这条线？如果您要查找其中的特定字段，最好指定哪些字段并提取它们。如果您想对 haproxy 日志运行统计信息，您应该查看源代码中“contrib”目录中的“halog”工具。以 1.4.9 版本为例，它甚至知道如何按响应时间对 URL 进行排序。

但是无论你想对这些行做什么，正则表达式可能总是最慢和最复杂的解决方案。

score 1 · Accepted Answer

这看起来像是一个非常复杂的字符串来匹配。我建议使用像Expresso这样的工具。从您尝试匹配的字符串开始，然后开始用正则表达式替换它的一部分。

要获取单个片段，请使用分组括号。

另一种选择是为您尝试抓取的每件作品制作一个正则表达式。

score 0 · Accepted Answer

我不认为正则表达式是你最好的选择......但是，如果它是你唯一的选择......

请尝试查看这些选项。 https://serverfault.com/q/62687/438

regex - Regex pattern to parse HttpLog format

5 回答 5

Related

Reference