python - 在 python 中使用正则表达式解析 Web 日志文件

Question

我有一个包含数字主机数据和字母数字用户名数据的网络日志文件。以下是日志文件中的几行：

189.254.43.43 - swift6867 [21/Jun/2019:15:53:00 -0700] "GET /architectures/recontextualize/morph/scale HTTP/1.0" 204 8976
20.80.28.12 - hagenes4423 [21/Jun/2019:15:53:01 -0700] "POST /harness HTTP/1.1" 404 28127
112.211.50.38 - - [21/Jun/2019:15:53:03 -0700] "DELETE /harness/e-business/functionalities HTTP/1.1" 405 7975

有时，用户名会替换为连字符。

我只想提取第一个方括号之前的数据，然后将其转换为字典列表。例如：

example_dict = {"host":"189.254.43.43", 
                "user_name":"swift6867"}

这是我使用的正则表达式：

pattern = """
    (?P<host>[\d]*[.][\d]*[.][\d]*[.][\d]*)     # host dictionary
    (?P<username>([\w]+|-)(?=\ \[))             # username dictionary 
"""

re.finditer(pattern,logdata,re.VERBOSE)

正则表达式找不到任何匹配项。只有个别的正则表达式有效。我的意思是，如果我注释掉用户名字典的正则表达式，主机字典的正则表达式将起作用，反之亦然。

我究竟做错了什么？

score 0 · Accepted Answer

您可以使用下一个正则表达式（演示）：

^(?P<host>(?:\d+\.?){4})\s*-\s*(?P<user_name>[^\s-]*?)\s

要创建 dicts 列表，您可以对返回的groupdict()每个Match对象应用finditer()：

import re
...
pattern = r'^(?P<host>(?:\d+\.?){4})\s*-\s*(?P<user_name>[^\s-]*?)\s'
result = [i.groupdict() for i in re.finditer(pattern, logdata, re.MULTILINE)]

使用这个正则表达式（演示）的步骤会少一点，所以对于更大的数据，它应该会稍微快一些：

^(?P<host>\d+\.\d+\.\d+\.\d+)\s*-\s*(?P<user_name>[^\s-]*?)\s

python - 在 python 中使用正则表达式解析 Web 日志文件

1 回答 1

Related

Reference