我目前正在尝试解析 Mailman txt-archive 文件。这些文件将通过列表发送的所有电子邮件连接在一个文件中。结构是这样的:
From SOMETHING
From: SOMETHING
Date: SOMETHING
Subject: SOMETHING
In-Reply-To: SOMETHING
Message-ID: <SOMETHING>
CONTENT
From SOMETHING
From: SOMETHING
Date: SOMETHING
Subject: SOMETHING
In-Reply-To: SOMETHING
Message-ID: SOMETHING
CONTENT
[...]
问题是CONTENT
可能包含换行符。所以我不能简单地将存档拆分为消息,然后解析每条消息。
我试图解析这个是:
def parseContent(content):
import re
pattern = r"From (.*)\n"+\
"From: (.*)\n"+\
"Date: (.*)\n"+\
"Subject: (.*)\n"+\
"In-Reply-To: (.*)\n"+\
"Message-ID: (.*)\n"+\
"(.*)"
matches = re.findall(pattern, content)
for from1, from2, date, subject, inreply, messageid, body in matches:
print from1
print body
print "#"*20
return matches
但body
不包含消息的正文,而只有一个换行符。我怎样才能使最后一个匹配组匹配所有内容,但是只要上面的部分匹配,身体匹配组就不应该匹配?