python - 如何获取 UTF-8 格式的电子邮件？

Question

我正在编写一个 Python 脚本来获取人们通过我的电子邮件地址发送的邮件。

我正在使用 ImapClient 模块，我得到了电子邮件的内容，但奇怪的是原型，我所有的 UTF-8 字符都被编码，如下所示：

否=C3=ABl

这是我的一段代码：

    email_message = email.message_from_bytes(message_data[b'RFC822'])
    print(email_message.get_payload(0))

我还尝试decode=True在 my 中添加参数get_payload，但它返回给我一个NoneType.

score 2 · Accepted Answer

您必须首先确定您感兴趣的电子邮件部分。然后，您将根据该部分的编码解码该部分的内容。每个部分可能有不同的编码和/或字符集。如果您对电子邮件的主体感兴趣，这通常是第一部分，可以是 html，也可以是纯文本，具体取决于发送它的程序（一些用户代理，如 gmail，将包括这两种形式）。

您可以在您的消息对象上使用电子邮件模块的EmailMessage.walk()函数来查看各种附件及其各自的内容类型。这些部分通过一个特殊的“边界”字符串（通常是随机的）彼此分开，该字符串不会出现在消息正文中（以避免歧义）。让电子邮件模块为您处理零件更容易——尤其是因为零件可以嵌套。

您在问题中显示的文本片段似乎是带引号的可打印编码。您可以在此处找到从quoted-printable 到 utf-8 的示例转换：Change "Quoted-printable" encoding to "utf-8"

一个例子：

我在下面添加了一个示例模拟原始消息，它表示构成 EmailMessage 对象的字节。在电子邮件中，每个部分/部分（主体、附件等）都可以有不同的内容类型、字符集和传输编码。部件可以嵌入子部件，但电子邮件通常只有一个扁平结构。对于作为附件的部分，查找内容处置值也很常见，该值将指示文件内容的建议文件名。

Subject: Woah
From: "Sébastien" <seb@example.org>
To: Bob <bob@example.org>
Content-Type: multipart/alternative; boundary="000000000000690fec05765c6a66"

--000000000000690fec05765c6a66
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

S=C3=A9bastien est un pr=C3=A9nom.

--000000000000690fec05765c6a66
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di=
r=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"lt=
r"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div=
dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr">...

...

选择感兴趣的部分后，您必须使用该部分的编码设置来正确转换有效负载。您将首先撤消任何传输编码（例如，quoted-printable），并根据字符集解码结果字符串。

如果您想要的部分的字符集已经是UTF-8，那么您所要做的就是撤消内容传输编码（例如删除引用的可打印序列）。但是，如果部件的字符集不同，比如 Latin-1，您将不得不从字节转换为 unicode，然后从 unicode 返回到 utf8：

# remove quoted-printable encoding
unquoted = quopri.decodestring(mime_part_payload)

# latin-1 in this case is the charset of the mime part header
tmp_unicode = unquoted.decode('latin-1', errors='ignore')

# encode to desired encoding
u8 = tmp_unicode.encode('utf-8')

python - 如何获取 UTF-8 格式的电子邮件？

1 回答 1

Related

Reference