python - Python中的多行匹配

Question

我已经阅读了我能找到的所有文章，甚至理解了其中的一些文章，但作为一个 Python 新手，我仍然有点迷茫，希望得到帮助:)

我正在编写一个脚本来从应用程序特定的日志文件中解析感兴趣的项目，每一行都以我可以匹配的时间戳开始，我可以定义两件事来识别我想要捕获的内容，一些部分内容和一个将终止我要提取的内容的字符串。

我的问题是多行的，在大多数情况下，每个日志行都以换行符终止，但某些条目包含 SQL，其中可能有新行，因此会在日志中创建新行。

所以，在一个简单的情况下，我可能有这个：

[8/21/13 11:30:33:557 PDT] 00000488 SystemOut     O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item  where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,'  $AAAA  ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc  (execution took 2083 milliseconds)

这一切都显示为一行，我可以与之匹配：

re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2}).*(milliseconds)')

但是在某些情况下，SQL 中可能存在换行符，因此我仍想捕获它（并可能用空格替换换行符）。我目前正在一次读取文件一行，这显然是行不通的，所以......

我需要一次性处理整个文件吗？它们的大小通常为 20mb。如何读取整个文件并遍历它以查找单行或多行块？
我将如何编写一个多行正则表达式来匹配一行上的整个内容或者它分布在多行上？

我的总体目标是对其进行参数化，以便我可以使用它来提取与起始字符串（始终是行的开头）、结束字符串（我想要捕获到的位置）和介于两者之间的值的不同模式匹配的日志条目它们作为标识符。

提前感谢您的帮助！

克里斯。

import sys, getopt, os, re

sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
lines = []
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')

lines = []
with open(logFileName, 'r') as f:
    for line in f:
        if lineStartsWith.match(line) and lineContains.match(line):
            if lineEndsWith.match(line) :
                print 'Full Line Found'
                print line
                print "- Record Separator -"
            else:
                print 'Partial Line Found'
                print line
                print "- Record Separator -"

print "--- DONE ----"

下一步，对于我的部分行，我将继续阅读，直到找到 lineEndsWith 并将这些行组合成一个块。

我不是专家，所以总是欢迎提出建议！

更新 - 所以我让它工作了，感谢所有帮助指导事情的回应，我意识到它并不漂亮，我需要清理我的 if / elif 混乱并使其更有效率，但它正在工作！感谢所有的帮助。

import sys, getopt, os, re

sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"

print "--- START ----"

lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')

lines = []

multiLine = False

with open(logFileName, 'r') as f:
    for line in f:
        if lineStartsWith.match(line) and lineContains.match(line) and lineEndsWith.match(line):
            lines.append(line.replace("\n", " "))
        elif lineStartsWith.match(line) and lineContains.match(line) and not multiLine:
            #Found the start of a multi-line entry
            multiLineString = line
            multiLine = True
        elif multiLine and not lineEndsWith.match(line):
            multiLineString = multiLineString + line
        elif multiLine and lineEndsWith.match(line):
            multiLineString = multiLineString + line
            multiLineString = multiLineString.replace("\n", " ")
            lines.append(multiLineString)
            multiLine = False

for line in lines:
    print line

score 3 · Accepted Answer

我需要一次性处理整个文件吗？它们的大小通常为 20mb。如何读取整个文件并遍历它以查找单行或多行块？

这里有两个选项。

您可以逐块读取文件，确保将每个块末尾的任何“剩余”位附加到下一个块的开头，然后搜索每个块。当然，您必须通过查看您的数据格式是什么以及您的正则表达式可以匹配什么来确定什么是“剩余”，并且理论上可以将多个块都算作剩余......</p>

或者你可以只是mmap文件。mmap 的作用类似于字节（或 Python 2.x 中的 str），并让操作系统根据需要处理分页块的进出。除非您尝试处理绝对巨大的文件（32 位中的千兆字节，64 位中的更多），否则这是微不足道且高效的：

with open('bigfile', 'rb') as f:
    with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as m:
        for match in compiled_re.finditer(m):
            do_stuff(match)

在旧版本的 Python 中，mmap它不是上下文管理器，因此您需要将其包装起来（或者如果您愿意contextlib.closing，也可以使用显式）。close

我将如何编写一个多行正则表达式来匹配一行上的整个内容或者它分布在多行上？

您可以使用DOTALL使.匹配换行符的标志。您可以改为使用MULTILINE标志并放入适当的$和/或^字符，但这会使简单的情况变得更加困难，而且很少需要。这是一个示例DOTALL（使用更简单的正则表达式使其更明显）：

>>> s1 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut     O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item  where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,'  $AAAA  ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc  (execution took 2083 milliseconds)"""
>>> s2 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut     O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item  where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and 
    (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,'  $AAAA  ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc  (execution took 2083 milliseconds)"""
>>> r = re.compile(r'\[(.*?)\].*?milliseconds\)', re.DOTALL)
>>> r.findall(s1)
['8/21/13 11:30:33:557 PDF']
>>> r.findall(s2)
['8/21/13 11:30:33:557 PDF']

如您所见，第二个.*?匹配换行符就像空格一样容易。

如果您只是想将换行符视为空格，则也不需要；'\s'已经捕捉到换行符。

例如：

>>> s1 = 'abc def\nghi\n'
>>> s2 = 'abc\ndef\nghi\n'
>>> r = re.compile(r'abc\s+def')
>>> r.findall(s1)
['abc def']
>>> r.findall(s2)
['abc\ndef']

score 0 · Accepted Answer

您可以将整个文件读入一个字符串，然后您可以使用 re.split 列出以时间分隔的所有条目。这是一个例子：

f = open(...)
allLines = ''.join(f.readlines())
entries = re.split(regex, allLines)

python - Python中的多行匹配

2 回答 2

Related

Reference