python - 如何处理电子邮件包中的 Python 3.x UnicodeDecodeError？

Question

我尝试从文件中读取电子邮件，如下所示：

import email
with open("xxx.eml") as f:
   msg = email.message_from_file(f)

我得到这个错误：

Traceback (most recent call last):
  File "I:\fakt\real\maildecode.py", line 53, in <module>
    main()
  File "I:\fakt\real\maildecode.py", line 50, in main
    decode_file(infile, outfile)
  File "I:\fakt\real\maildecode.py", line 30, in decode_file
    msg = email.message_from_file(f)  #, policy=mypol
  File "C:\Python33\lib\email\__init__.py", line 56, in message_from_file
    return Parser(*args, **kws).parse(fp)
  File "C:\Python33\lib\email\parser.py", line 55, in parse
    data = fp.read(8192)
  File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1920: character maps to <undefined>

该文件包含一个多部分电子邮件，其中该部分以 UTF-8 编码。文件的内容或编码可能已损坏，但无论如何我都必须处理它。

我如何读取文件，即使它有 Unicode 错误？我找不到策略对象compat32，似乎没有办法处理异常并让 Python 在异常发生的地方继续。

我能做些什么？

score 5 · Accepted Answer

要在 Python 3 中解析没有 unicode 错误的电子邮件消息，请以二进制模式读取文件并使用email.message_from_binary_file(f)(or email.message_from_bytes(f.read())) 方法解析内容（请参阅email.parser 模块的文档）。

以下代码以与 Python 2 和 3 兼容的方式解析消息：

import email
with open("xxx.eml", "rb") as f:
    try:
        msg = email.message_from_binary_file(f)  # Python 3
    except AttributeError:
        msg = email.message_from_file(f)  # Python 2

（使用 Python 2.7.13 和 Python 3.6.0 测试）

score 4 · Accepted Answer

我无法测试您的消息，所以我不知道这是否真的有效，但您可以自己进行字符串解码：

with open("xxx.eml", encoding='utf-8', errors='replace') as f:
    text = f.read()
    msg = email.message_from_string(f)

如果消息实际上不是 UTF-8 格式，那将为您提供大量替换字符。但如果它\x81在里面，我猜是 UTF-8。

score 0 · Accepted Answer

with open('email.txt','rb') as f:
     ascii_txt = f.read().encode('ascii','backslashreplace')

with open('email.txt','w') as f:
     f.write(ascii_text)

#now do your processing stuff

我怀疑这是处理这个问题的最佳方式......但它至少是一种方式......

score 0 · Accepted Answer

一种适用于 python 3 的方法，它找到了编码并重新加载正确的编码。

msg=email.message_from_file(open('file.eml',  errors='replace'))
codes=[x for x in msg.get_charsets() if x!=None]
if len(codes)>=1 : 
    msg=email.message_from_file(open('file.eml', encoding=codes[0]))

我已经尝试过msg.get_charset()，但有时它会None在另一种编码可用时回答，因此编码检测稍微涉及

python - 如何处理电子邮件包中的 Python 3.x UnicodeDecodeError？

4 回答 4

Related

Reference