python - 在没有 html 或原始消息的线程中提取消息

Question

对于任何线程查询“to:xyz@gmail.com”，我已经能够返回所有匹配的线程 ID，然后将它们输入到消息获取方法中以返回所有线程中所有消息的列表。

但是，许多消息包含所有以前的消息，为每条消息创建一个面包屑并大大扩大了返回的每条消息的大小。其他消息也包含 html 元素。

解析所有这些以仅返回发送和接收的消息，而没有所有面包屑和 html 多余的最佳方法是什么？

score 1 · Accepted Answer

基于gmail原始消息格式，我把这个非常粗略的解析放在一起。它的工作方式是使用第一个内容类型来获取多部分边界。然后它根据边界拆分消息并获取第一部分。

这省略了所有的 html，只给我们留下了文本消息和面包屑来处理。

之后我们可以逐行拆分消息，去掉剩余的内容信息，取消息，到达第一个回复时停止。

multipart_boundary = ''
for r in messages.split('\n'):
    if r.startswith('Content-Type: multipart/alternative; boundary='):
        multipart_boundary = r[r.find('boundary=') + 9:]
        break

#print multipart_boundary
offset = len(multipart_boundary) + 2
messages = messages[messages.find('--' + multipart_boundary)+offset:]
messages = messages[:messages.find('--' + multipart_boundary)]
newmsg = ""
for line in messages.split('\n'):
    if line.startswith('Content-') or line.startswith('>'):
        continue
    elif line.startswith('On') and line.strip().endswith('wrote:'):
        break
    else:
        newmsg = newmsg + "\n" + line

print newmsg

python - 在没有 html 或原始消息的线程中提取消息

1 回答 1

Related

Reference