python - 如何使用 Python 2 处理错误编码的字符？

Question

我正在获取的 HTML 文件包含一些 HTML 标头中指定的编码不支持的字符：

我发现以下是 Shift_JIS 编码不支持但实际使用的。我的浏览器可以正确显示这些字符。

当我尝试读取此 HTML 文件并进行解码以进行处理时，我得到了 UnicodeDecodeError。

url = 'http://matsucon.net/material/dic/kao09.html'
response = urllib2.urlopen(url)
response.read().decode('shift_jis_2004')

有什么好方法可以处理包含错误编码字符的 HTML 而不会出错？

score 1 · Accepted Answer

1

尝试这个：

response.read().decode('shift_jis_2004',errors='ignore')

于 2014-11-27T09:40:07.873 回答

1 回答 1