python - 当正文中有 unicode 字符时，使用 python 解析 Gmail 电子邮件

Question

我写了一个脚本来解析一封电子邮件。从 Mac OS X Mail 客户端接收信件时它工作正常（到目前为止仅测试过这个），但是当信件的正文部分包含 unicode 字母时，我的解析器会失败。

例如，我发送了一条带有 content 的消息ąčę。

这是我的脚本部分，它同时解析正文和附件：

p = FeedParser()
p.feed(msg)
msg = p.close()
attachments = []
body = None
for part in msg.walk():
  if part.get_content_type().startswith('multipart/'):
    continue
  try:
    filename = part.get_filename()
  except:
    # unicode letters in filename, set default name then
    filename = 'Mail attachment'

  if part.get_content_type() == "text/plain" and not body:
    body = part.get_payload(decode=True)
  elif filename is not None:
    content_type = part.get_content_type()
    attachments.append(ContentFile(part.get_payload(decode=True), filename))

if body is None:
    body = ''

好吧，我提到它适用于来自 OS X Mail 的信件，但不适用于 Gmail 信件。

追溯：

Traceback（最近一次调用最后一次）：文件“/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/core/handlers/base.py”，第 116 行，在 get_response response = callback （请求，*callback_args，**callback_kwargs）文件“/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/views/decorators/csrf.py”，第 77 行，在 Wrapped_view 返回view_func(*args, **kwargs) 文件“/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/views/decorators/http.py”，第 41 行，内部返回 func （请求，*args，**kwargs）文件“/Users/aemdy/PycharmProjects/rezervavau/bms/messages/views.py”，第 66 行，接受 Message.accept(request.POST.get('msg'))文件“/Users/aemdy/PycharmProjects/rezervavau/bms/messages/models.py”，第 261 行，在接受线程=线程文件“/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/manager.py”，第 149 行，在创建返回 self.get_query_set() .create(**kwargs) 文件“/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/query.py”，第 391 行，创建 obj.save( force_insert=True, using=self.db) 文件“/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/base.py”，第 532 行，保存 force_update =force_update, update_fields=update_fields) 文件“/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/base.py”，第 627 行，在 save_base 结果 = 管理器中。 _insert([self], fields=fields, return_id=update_pk, using=using, raw=raw) 文件“/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/manager.py”，第 215 行，在 _insert 返回 insert_query(self.model, objs, fields, **kwargs) 文件“/Users/aemdy/virtualenvs/django1.5/ lib/python2.7/site-packages/django/db/models/query.py”，第 1633 行，在 insert_query 返回 query.get_compiler(using=using).execute_sql(return_id) 文件“/Users/aemdy/virtualenvs/django1 .5/lib/python2.7/site-packages/django/db/models/sql/compiler.py”，第 920 行，在 execute_sql cursor.execute(sql, params) 文件“/Users/aemdy/virtualenvs/django1. 5/lib/python2.7/site-packages/django/db/backends/util.py”，第 47 行，在执行 sql = self.db.ops.last_executed_query(self.cursor, sql, params) 文件“/Users /aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/operations.py”，第 201 行，在 last_executed_query 返回 cursor.query.decode('utf-8') 文件“/Users/aemdy/virtualenvs/django1.5/lib/python2.7/encodings/utf_8.py”，第 16 行，在 decode 返回 codecs.utf_8_decode （输入，错误，真）UnicodeDecodeError：'utf8'编解码器无法解码位置 115 中的字节 0xe0：无效的继续字节

我的脚本给了我以下 body ��。我怎样才能解码它ąčę回来？

score 6 · Accepted Answer

好吧，我自己找到了解决方案。我现在会做一些测试，如果有任何失败，我现在就让你们看看。

我需要再次解码身体：

body = part.get_payload(decode=True).decode(part.get_content_charset())

score 1 · Accepted Answer

您可能想尝试使用这个：

from email.Iterators import typed_subpart_iterator


def get_charset(message, default="ascii"):
    """Get the message charset"""

    if message.get_content_charset():
        return message.get_content_charset()

    if message.get_charset():
        return message.get_charset()

    return default

def get_body(message):
    """Get the body of the email message"""

    if message.is_multipart():
        #get the plain text version only
        text_parts = [part
                      for part in typed_subpart_iterator(message,
                                                         'text',
                                                         'plain')]
        body = []
        for part in text_parts:
            charset = get_charset(part, get_charset(message))
            body.append(unicode(part.get_payload(decode=True),
                                charset,
                                "replace"))

        return u"\n".join(body).strip()

    else: # if it is not multipart, the payload will be a string
          # representing the message body
        body = unicode(message.get_payload(decode=True),
                       get_charset(message),
                       "replace")
        return body.strip()

score 0 · Accepted Answer

0

您可能想看看email.iterators（虽然不确定它会解决您的编码问题）。

于 2012-11-18T23:14:34.063 回答

python - 当正文中有 unicode 字符时，使用 python 解析 Gmail 电子邮件

3 回答 3

Related

Reference