4

As part of some email batch processing, we need to decode and clean up the messages. One critical part of that process is separating the mail bodies of a message and the mail attachments. The trickiest part is to determine when a Conent-Disposition: inline part is to be considered a message body alternative or a file.

So far, this code seems to handle most of the cases:

from email import message_from_string

def split_parts(raw):
    msg = message_from_string(raw)
    bodies = []
    files = []

    for sub in msg.walk():
        if sub.is_multipart():
            continue
        cd = sub.get("Content-Disposition", "")
        if cd.startswith("attachment") or (cd.startswith("inline") and
                                           sub.get_filename()):
            files.append(sub)
        else:
            bodies.append(sub)

    return bodies, files

Note the reliance on the inline parts to have a filename specified in the headers, which Outlook seems to do for all its multipart/related messages. The Content-ID could also be used as a hint, but according to the RFC 2387 it is not such an indicator.

Therefore, if an embedded image is encoded as a message part that has Content-Disposition: inline, defines a Content-ID and doesn't have a filename then the above code can mistakenly classify it as a message body alternative.

From what I've read from the RFC's, there's not much hope on finding an easy check (specially since coding according to the RFCs is almost useless in the real world, because nobody does it); but I was wondering how big the chances are to hit the misclassification case.


Rationale

I could have a set of functions to treat each multipart/* case and let them indirectly recurse. However, we don't care so much about a faithful display; as a matter of fact, we filter all HTML messages through tidy. Instead, we are more interested in chosing one of the message body alternatives and saving as many attachments as possible, even if they are intended to be embedded.

Furthermore, some user agents do really weird things when composing multipart/alternative messages with embedded attachments that are not intended to be displayed inline (such as PDF files), as a result of the user dragging and dropping an arbitrary file into the composition window.

4

1 回答 1

3

I'm not quite following you, but, if you want bodies, I would assume anything with a text/plain or text/html content type, with an inline content disposition, with no file name or no content-id, could be a body part.

于 2012-10-02T17:53:48.243 回答