2

这是一个自我回答的问题,我会重视任何输入/评论/代码审查。我认为它有效,但不确定。我的测试似乎产生了有效的结果,但电子邮件是一种微妙而复杂的野兽。我不确定我的逻辑是否合理。如果您想对其进行测试,请保存原始电子邮件文件并将文件名放入代码中,应该很明显在哪里。有一个更好的方法吗?如果是的话,我很想听听。

Python 2.7 代码。

import email

filename = 'xxx.eml'

with open(filename, 'rb') as f:
    msg = email.message_from_file(f)

    # count number of attachments in an email
    # this determines the 'real' attachments, ie those that a user might have attached to the email
    # it does not include the attachments that make up the message content
    totalattachments = 0
    firsttextattachmentseen = False
    lastseenboundary = ''
    # .walk steps through all the parts of an email including boundaries and attachments
    for part in msg.walk():
        if part.is_multipart():
            # this is a boundary, not an attachment, so we record it as the last seen boundary and continue to next part
            lastseenboundary = part.get_content_type()
            continue
        if lastseenboundary == 'multipart/alternative':
            #for HTML emails, the multipart/alternative part contains the HTML and its alternative 
            #text representation, so we skip anything within the multipart/alternative boundary
            continue
        if part.get_content_type() == 'text/plain':
            #if this is a plain text email, then the first txt attachment is the message body so we do not 
            #count it as an attachment
            if firsttextattachmentseen == False:
                firsttextattachmentseen = True
                continue
            else:
                totalattachments += 1
                continue
        # any other part we encounter we shall assume is a user added attachment
        totalattachments += 1

    print(totalattachments, ': ', filename)
4

1 回答 1

0

似乎许多(全部?)具有多部分/替代边界的电子邮件包含 text/plain 然后 text/html 然后直接包含文件附件。电子邮件是否应该以这种方式构建?我不这么认为,但你有它。

因此,这是一个更新版本,它将文本和 html 之后的 multpart/alternative 边界内的所有部分视为附件。

电子邮件....呃。

import email

filename = 'xxx.eml'

with open(filename, 'rb') as f:
    msg = email.message_from_file(f)

    # count number of attachments in an email
    #
    totalattachments = 0
    firsttextattachmentseen = False
    lastseenboundary = ''
    # .walk steps through all the parts of an email including boundaries and attachments
    alternativetextplainfound = False
    alternativetexthtmlfound = False
    for part in msg.walk():
        if part.is_multipart():
            # this is a boundary, so we record what the last seen boundary was and continue to next part
            lastseenboundary = part.get_content_type()
            continue
        if lastseenboundary == 'multipart/alternative':
            #for HTML emails, the multipart/alternative part contains the HTML and its alternative text representation
            #BUT it seems that plenty of messages add file attachments after the txt and html, so we'll have to account for that
            if part.get_content_type() == 'text/plain' and alternativetextplainfound == False:
                alternativetextplainfound = True
                continue
            if part.get_content_type() == 'text/html' and alternativetexthtmlfound == False:
                alternativetexthtmlfound = True
                continue
        if (part.get_content_type() == 'text/plain') and (lastseenboundary != 'multipart/alternative'):
            #if this is a plain text email, then the first txt attachment is the message body so we do not 
            #count it as an attachment
            if firsttextattachmentseen == False:
                firsttextattachmentseen = True
                continue
            else:
                totalattachments += 1
                continue
        # any other part we encounter we shall assume is a user added attachment
        totalattachments += 1

    print(totalattachments, ': ', filename)
于 2013-05-11T12:54:17.293 回答