python - Python - 读取单行/多行文件

Question

我对 Python 很陌生，我在这里找到了大多数问题的答案，但这个问题让我很困惑。

我正在使用 Python 处理日志文件，通常每一行都以日期/时间戳开头，例如：

[1/4/13 18:37:37:848 PST]

在 99% 的情况下，我可以逐行阅读，查找感兴趣的项目并相应地处理它们，但有时日志文件中的条目会包含一条消息，其中包含回车符/换行符，因此它将跨越多个线。

有没有一种方法可以轻松地“在时间戳之间”读取文件，以便在发生这种情况时将多行合并为一次读取？例如：

[1/4/13 18:37:37:848 PST] A log entry
[1/4/13 18:37:37:848 PST] Another log entry
[1/4/13 18:37:37:848 PST] A log entry that somehow
got some new line
characters mixed in
[1/4/13 18:37:37:848 PST] The last log entry

将被解读为四行而不是现在的六行。

提前感谢您的帮助。

克里斯，

更新....

myTestFile.log 包含上面的确切文本，这是我的脚本：

import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/myTestFile.log"
lines = []

def timestamp_split(file):
    pattern = re.compile("\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )")
    current = []
    for line in file:
        if not re.match(pattern,line):
            if current:
                yield "".join(current)
            current == [line]
        else:
            current.append(line)
    yield "".join(current)

print "--- START ----"
with open(logFileName) as file:
    for entry in timestamp_split(file):
        print entry
        print "- Record Separator -"
print "--- DONE ----"

当我运行它时，我得到了这个：

--- START ----
[1/4/13 18:37:37:848 PST] A log entry
[1/4/13 18:37:37:848 PST] Another log entry
[1/4/13 18:37:37:848 PST] A log entry that somehow

- Record Separator -
[1/4/13 18:37:37:848 PST] A log entry
[1/4/13 18:37:37:848 PST] Another log entry
[1/4/13 18:37:37:848 PST] A log entry that somehow

- Record Separator -
[1/4/13 18:37:37:848 PST] A log entry
[1/4/13 18:37:37:848 PST] Another log entry
[1/4/13 18:37:37:848 PST] A log entry that somehow
[1/4/13 18:37:37:848 PST] The last log entry
- Record Separator -
--- DONE ----

我似乎在这些行中迭代了太多次，我期待（希望）的是：

--- START ----
[1/4/13 18:37:37:848 PST] A log entry
- Record Separator -
[1/4/13 18:37:37:848 PST] Another log entry
- Record Separator -
[1/4/13 18:37:37:848 PST] A log entry that somehow got some new line characters mixed in
- Record Separator -
[1/4/13 18:37:37:848 PST] The last log entry
- Record Separator -
--- DONE ----

正如评论中所讨论的，我在测试时不小心将not留在了与正则表达式模式的比较中，如果我删除它，那么我会得到所有让我更加困惑的部分行！

--- START ----
got some new line
characters mixed in

- Record Separator -
got some new line
characters mixed in

- Record Separator -
--- DONE ----

score 3 · Accepted Answer

最简单的方法是实现一个简单的生成器来执行此操作：

def timestamp_split(file):
    current = []
    for line in file:
        if line.startswith("["):
            if current:
                yield "".join(current)
            current == [line]
        else:
            current.append(line)
    yield "".join(current)

自然地，这假设"["在一行的开头足以表示时间戳 - 您可能想要进行更重要的检查。

然后只需执行以下操作：

with open("somefile.txt") as file:
    for entry in timestamp_split(file):
        ...

（这里使用语句- 打开文件的with好习惯。）

score 0 · Accepted Answer

import re

lines = []
pattern = re.compile('\[\d+/\d+/\d+\s\d+:\d+:\d+\s\w+\]')
with open('filename.txt', 'r') as f:
    for line in f:
        if re.match(pattern, line):
            lines.append(line)
        else:
            lines[-1] += line

这将时间戳与正则表达式匹配。可以根据需要进行调整。它还假设第一行包含一个时间戳。

python - Python - 读取单行/多行文件

2 回答 2

Related

Reference