python - 解码未知字符串

Question

我有一个数据源，我无法控制，它发送具有不同编码的字符串，我无法提前知道编码！我需要知道能够以我理解和控制的格式正确解码和正确存储的格式，比如说 UTF-8。

例如：

“J'ai dÃ©jÃ\xa0 un problÃ¨me, aprÃ¨s...je ne sais pas”

应该读

“J'ai déjà un problème, après...je ne sais pas”

我试过的：

> stringToTest="J'ai dÃ©jÃ\xa0 un problÃ¨me, aprÃ¨s... je ne sais pas"
# there is no decode for string, directly, but one can try
> stringToTest.encode().decode()
"J'ai dÃ©jÃ\xa0 un problÃ¨me, aprÃ¨s... je ne sais pas"
# what does not help :)

通过反复试验，我发现编码是“iso-8859-1”

> stringToTest.encode('iso-8859-1').decode()
"J'ai déjà un problème, après... je ne sais pas"

我想要/需要的是自动找到“iso-8859-1”！

我试着用chardet！

> import chardet

> chardet.detect(stringToTest)
Traceback (most recent call last):
  File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 34, in detect
    '{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: <class 'str'>

但是......因为它是一个字符串...... chardet 不接受它！而且，我很惭愧地承认，但我没有设法将字符串转换为 chardet 接受的东西！

> test1=b"J'ai dÃ©jÃ un problÃ¨me, aprÃ¨s... je ne sais pas"
  File "<input>", line 1
SyntaxError: bytes can only contain ASCII literal characters.

# Ok str and unicode are similar things... but who knows?!?!
> test1=u"J'ai dÃ©jÃ un problÃ¨me, aprÃ¨s... je ne sais pas"
> chardet.detect(test1)
Traceback (most recent call last):
  File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 34, in detect
    '{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: <class 'str'>

# NOP
> bytes(stringToTest)
Traceback (most recent call last):
  File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
TypeError: string argument without an encoding

为什么不单解码？！？

from unidecode import unidecode

from unidecode import unidecode
unidecode(stringToTest)
'J\'ai dA(c)jA un problA"me, aprA"s... je ne sais pas'

score 1 · Accepted Answer

字符串

"J'ai dÃ©jÃ\xa0 un problÃ¨me, aprÃ¨s... je ne sais pas"

是mojibake编码文本 ( bytes) 的示例，该文本已使用错误编码进行解码。在这种特殊情况下，字符串最初被编码为 UTF-8；重新编码为 ISO-8859-1 (latin-1) 会重新创建 UTF-8 字节，从 UTF-8（Python3 中的默认值）解码会产生预期的结果。

如果您从外部源接收这些mojibake字符串，您可以使用 ISO-8859-1 安全地对它们进行编码以重新创建原始字节。字节 -编码文本- 可以传递给chardet分析。

python - 解码未知字符串

1 回答 1

Related

Reference