python-3.x - 从 MBOX 文件中提取纯文本/文本和 html 正文到列表

Question

我正在尝试从 mbox 文件（之前从 PST 格式转换）中提取电子邮件正文。

我从另一个 [松弛问题] 中获取了基本功能（从 mbox 文件中提取电子邮件正文，将其解码为纯文本，而不考虑字符集和内容传输编码）。它适用于提取“纯/文本”正文内容，但我也想提取“html”内容。

从代码的最后一部分调用函数来提取正文，我尝试修改它以将文本和 html 字符串存储在单独的列表中。

import mailbox

def getcharsets(msg):
    charsets = set({})
    for c in msg.get_charsets():
        if c is not None:
            charsets.update([c])
    return charsets

def handleerror(errmsg, emailmsg, cs):
    print()
    print(errmsg)
    print("This error occurred while decoding with ",cs," charset.")
    print("These charsets were found in the one email.",getcharsets(emailmsg))
    print("This is the subject:",emailmsg['subject'])
    print("This is the sender:",emailmsg['From'])

def getbodyfromemail(msg):
    body = 'no_text'
    body_html = 'no_html'
    #Walk through the parts of the email to find the text body.    
    if msg.is_multipart():    
        for part in msg.walk():

            # If part is multipart, walk through the subparts.            
            if part.is_multipart(): 

                for subpart in part.walk():
                    if subpart.get_content_type() == 'text/plain':
                        # Get the subpart payload (i.e the message body)
                        body = subpart.get_payload(decode=True) 
                        #charset = subpart.get_charset()
                    elif subpart.get_content_type() == 'html':
                        body_html = subpart.get_payload(decode=True)
                        #body_html = subpart.get_payload(decode=True)

            # Part isn't multipart so get the email body
            elif part.get_content_type() == 'text/plain':
                body = part.get_payload(decode=True)
                #charset = part.get_charset()

    # If this isn't a multi-part message then get the payload (i.e the message body)
    elif msg.get_content_type() == 'text/plain':
        body = msg.get_payload(decode=True) 

   # No checking done to match the charset with the correct part. 
    for charset in getcharsets(msg):
        try:
            body = body.decode(charset)
        except UnicodeDecodeError:
            handleerror("UnicodeDecodeError: encountered.",msg,charset)
        except AttributeError:
             handleerror("AttributeError: encountered" ,msg,charset)
    return body, body_html

mboxfile = 'Bandeja de entrada'
body = []
body_html = []
for thisemail in mailbox.mbox(mboxfile):
    body = body.append(getbodyfromemail(thisemail)[0])
    body_html = body_html.append(getbodyfromemail(thisemail)[1])
    print(body_html)

但是现在，给我一个错误：AttributeError：'NoneType'对象没有属性'append'我期望的输出：

body = [string, string, string]
body_html = [html, html, html]

score 0 · Accepted Answer

您的代码对我有用，除了您应该用以下内容替换列表附加：

for thisemail in mailbox.mbox(mboxfile):
    body.append(getbodyfromemail(thisemail)[0])
    body_html.append(getbodyfromemail(thisemail)[1])
    print(body_html)

Python list append 可以正常工作，因此它返回None. 您还可以将列表附加替换为例如：

body = body + [getbodyfromemail(thisemail)[0]]

python-3.x - 从 MBOX 文件中提取纯文本/文本和 html 正文到列表

1 回答 1

Related

Reference