我正在解码一个大型(大约千兆字节)平面文件数据库,它随意混合字符编码。到目前为止, python 模块chardet
在识别编码方面做得很好,但如果遇到绊脚石......
In [428]: badish[-3]
Out[428]: '\t\t\t"Kuzey r\xfczgari" (2007) {(#1.2)} [Kaz\xc4\xb1m]\n'
In [429]: chardet.detect(badish[-3])
Out[429]: {'confidence': 0.98999999999999999, 'encoding': 'Big5'}
In [430]: unicode(badish[-3], 'Big5')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
~/src/imdb/<ipython console> in <module>()
UnicodeDecodeError: 'big5' codec can't decode bytes in position 11-12: illegal multibyte sequence
chardet 报告对它的编码选择非常有信心,但它不解码......还有其他明智的方法吗?