python - 使用 re.split() 分割字符串

Question

我在一个字符串中有很多电子邮件。我需要将此字符串拆分为单独的电子邮件。每封电子邮件都以新行中的“发件人：”开头。如果身体其他任何地方都没有“发件人：”，那么以下工作 -

list_of_email_strings = re.split("From:", my_email_text_string)

我需要忽略在新行之后不会立即出现的“发件人：”。以下（带插入符号）不起作用 -

list_of_email_strings = re.split("^From:", my_email_text_string)

解决方案？

score 1 · Accepted Answer

你可以结合\n一个非消耗性的前瞻断言(?=...)，它的优点是不吃你正在分割的字符串（例如“From：”保持不变）。

list_of_email_strings = re.split("\n(?=From:)", my_email_text_string)

例如：

>>> s = "From: ...\nFrom: ...\nFrom: ..."
>>> re.split("\n(?=From:)", s)
['From:...', 'From:...', 'From:...']

相比于：

>>> re.split("\nFrom:", s)
['From: ...', ' ...', ' ...']

score 1 · Accepted Answer

类似于 wim 的答案，但 From: 根据需要被添加回电子邮件中：

list = ['From:' + msg for msg in ('\n' + text).split('\nFrom:')]

但是，有一些原生 Python 模块可以让您更好、更可靠地控制阅读您描述的电子邮件文件。电子邮件和邮箱浮现在脑海。

假设这些是标准 mbox 样式的电子邮件，其中每个文件都以“From:”开头，然后是一些标题行，可能是摘要等 - 就像 sendmail 或 Postfix 使用的那些 - 如果您先编写字符串到文件或仅使用现有文件：

mbox = mailbox.mbox(path_to_mailbox_file)
mbox.lock()  # only if you're using an active mailbox file
message_strings = [message.as_string() for message in mbox]
mbox.unlock()  # again, only if you're using an acture mailbox file
mbox.close()

要获取消息数量，只需使用len(mbox).

还有很多其他有用的功能。我已经使用这些 mudules 制作了一些脚本，并且对结果非常满意。（请注意，as_string可能会重新格式化某些标题。）

python - 使用 re.split() 分割字符串

2 回答 2

Related

Reference