我有一个数据源,我无法控制,它发送具有不同编码的字符串,我无法提前知道编码!我需要知道能够以我理解和控制的格式正确解码和正确存储的格式,比如说 UTF-8。
例如:
- “J'ai déjÃ\xa0 un problème, après...je ne sais pas”
应该读
- “J'ai déjà un problème, après...je ne sais pas”
我试过的:
> stringToTest="J'ai déjÃ\xa0 un problème, après... je ne sais pas"
# there is no decode for string, directly, but one can try
> stringToTest.encode().decode()
"J'ai déjÃ\xa0 un problème, après... je ne sais pas"
# what does not help :)
通过反复试验,我发现编码是“iso-8859-1”
> stringToTest.encode('iso-8859-1').decode()
"J'ai déjà un problème, après... je ne sais pas"
我想要/需要的是自动找到“iso-8859-1”!
我试着用chardet!
> import chardet
> chardet.detect(stringToTest)
Traceback (most recent call last):
File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 34, in detect
'{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: <class 'str'>
但是......因为它是一个字符串...... chardet 不接受它!而且,我很惭愧地承认,但我没有设法将字符串转换为 chardet 接受的东西!
> test1=b"J'ai déjà un problème, après... je ne sais pas"
File "<input>", line 1
SyntaxError: bytes can only contain ASCII literal characters.
# Ok str and unicode are similar things... but who knows?!?!
> test1=u"J'ai déjà un problème, après... je ne sais pas"
> chardet.detect(test1)
Traceback (most recent call last):
File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 34, in detect
'{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: <class 'str'>
# NOP
> bytes(stringToTest)
Traceback (most recent call last):
File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
TypeError: string argument without an encoding
为什么不单解码?!?
from unidecode import unidecode
from unidecode import unidecode
unidecode(stringToTest)
'J\'ai dA(c)jA un problA"me, aprA"s... je ne sais pas'