3

我想阅读来自 Gmail 备份的 3GB 大 .mbox 文件。这有效:

import mailbox
mbox = mailbox.mbox(r"D:\All mail Including Spam and Trash.mbox")
for i, message in enumerate(mbox):
    print("from   :",message['from'])
    print("subject:",message['subject'])
    if message.is_multipart():
        content = ''.join(part.get_payload(decode=True) for part in message.get_payload())
    else:
        content = message.get_payload(decode=True)
    print("content:",content)
    print("**************************************")

    if i == 10:
        break

除了前 10 条消息需要超过 40 秒。

有没有更快的方法来使用 Python 访问一个大的 .mbox 文件?

4

1 回答 1

5

mbox这是实现生成器以逐条读取文件消息的快速而肮脏的尝试。我选择简单地放弃From分隔符中的信息;我猜也许真正的mailbox库可能会提供更多信息,当然,这只支持读取,不支持搜索或写回输入文件。

#!/usr/bin/env python3

import email
from email.policy import default

class MboxReader:
    def __init__(self, filename):
        self.handle = open(filename, 'rb')
        assert self.handle.readline().startswith(b'From ')

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, exc_traceback):
        self.handle.close()

    def __iter__(self):
        return iter(self.__next__())

    def __next__(self):
        lines = []
        while True:
            line = self.handle.readline()
            if line == b'' or line.startswith(b'From '):
                yield email.message_from_bytes(b''.join(lines), policy=default)
                if line == b'':
                    break
                lines = []
                continue
            lines.append(line)

用法:

with MboxReader(mboxfilename) as mbox:
    for message in mbox:
        print(message.as_string())
于 2020-01-10T13:17:34.253 回答