python - 解码 shift-jis：“非法多字节序列”

Question

我正在尝试解码 shift-jis 编码的字符串，如下所示：

string.decode('shift-jis').encode('utf-8')

能够在我的程序中查看它。

当我遇到 2 个 shift-jis 字符时，十六进制“0x87 0x54”和“0x87 0x55”，我收到此错误：

UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 12-13: illegal multibyte sequence

但我确定它们是有效的 shift-jis 字符：http ://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml

我还注意到这些字符在我的 shift-jis 文本编辑器中显示为黑框，这意味着它们无法识别。所以这两个字符有一些特别之处，导致我的编辑器和 Python 解码器失败。帮助？

（抱歉，我无法发布示例字符串，因为当这些字符存在时，它不会从那里添加到剪贴板，并且还会自动转换为 unicode。不过，我为它们发布了十六进制值。）

score 9 · Accepted Answer

存在多个版本的 Shift JIS。shift_jis编解码器是JIS X 0208，而该表是JIS X 0213，对应于shift_jisx0213编解码器。

>>> u'⑲⑳Ⅰ'.encode('shift_jisx0213')
'\x87R\x87S\x87T'

score 1 · Accepted Answer

你永远不应该使用shift_jisx0213. 它从未用于实际生产目的。Windows 无法处理它。在大多数情况下，字符集 JIS X 0213 与 Unicode 一起使用，但不与 Shift_JIS 编码一起使用。

使用'cp932'（在 Python 3 中）。

./sjis.txt 包含

5c  7e  87  52  87  53  87  54  87  8a  fa  b1  fb  50  fb  fc

（它们是Windows 10保存的\~⑲⑳Ⅰ㈱﨑濑髙）

>>> import codecs
>>> codecs.open('sjis.txt',"rb",'shift_jis').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/codecs.py", line 700, in read
    return self.reader.read(size)
UnicodeDecodeError: 'shift_jis' codec can't decode byte 0x87 in position 2: illegal multibyte sequence
>>> codecs.open('sjis.txt',"rb",'shift_jisx0213').read()
'¥‾⑲⑳Ⅰ㈱郫鍚騠'
>>> codecs.open('sjis.txt',"rb",'cp932').read()
'\\~⑲⑳Ⅰ㈱﨑瀨髙'

shift_jisx0213错误地解码符号和最后三个汉字。

python - 解码 shift-jis：“非法多字节序列”

2 回答 2

Related

Reference