python - UnicodeDammit：detwingle 在网站上崩溃

Question

我正在抓取网站并使用BeautifulSoup4来解析它们。由于网站可以有真正随机的字符集，我使用UnicodeDammit.detwingle来确保我向 BeautifulSoup 提供正确的数据。它工作得很好......直到它崩溃了。一个网站导致代码中断。构建“汤”的代码如下所示：

u = bs.UnicodeDammit.detwingle( html_blob ) <--- here it crashes
u = bs.UnicodeDammit( u.decode('utf-8'), 
                      smart_quotes_to='html', 
                      is_html = True )
u = u.unicode_markup
soup = bs.BeautifulSoup( u )

还有错误（标准 Python-Unicode 地狱二重奏）

  File ".../something.py", line 92, in load_bs_from_html_blob
    u = bs.UnicodeDammit.detwingle( html_blob )
  File ".../beautifulsoup4-4.1.3-py2.7.egg/bs4/dammit.py", line 802, in detwingle
    return b''.join(byte_chunks)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0:
ordinal not in range(128)

违规网站是这个

问题：如何进行正确且防弹的网站源解码？

score 4 · Accepted Answer

这个网站在字符编码方面根本不是一个特例，它是完全有效的 utf-8，即使 http 标头设置正确。然后，您的代码将在任何以 utf-8 编码且代码点超出 ASCII 的网站上崩溃。

~~从文档中也可以看出，它UnicodeDammit.detwingle采用 unicode 字符串。您正在传递它html_blob，变量命名表明它不是解码的 unicode 字符串。~~（误解）

在 http 标头或标记与编码有关或根本不包含的情况下，处理任何网站编码并非易事。你需要执行各种启发式方法，即使那样你也不会做对。但是这个网站正在正确发送字符集标头，并且已在该字符集中正确编码。

有趣的琐事。网站中唯一超出 ASCII 文本的是这些 javascript 注释（在被解码为 utf-8 之后）：

image = new Array(4); //¶¨ÒåimageÎªÍ¼Æ¬ÊýÁ¿µÄÊý×é 
image[0] = 'sample_BG_image01.png' //±³¾°Í¼ÏóµÄÂ·¾¶

如果您随后将它们编码为 ISO-8859-1，并将结果解码为 GB2312，您将得到：

image = new Array(4); //定义image为图片数量的数组
image[0] = 'sample_BG_image01.png' //背景图象的路径

哪个谷歌中文->英文，翻译成：

image = new Array(4); //Defined image of the array of the number of images
image[0] = 'sample_BG_image01.png' //The path of the background image

python - UnicodeDammit：detwingle 在网站上崩溃

1 回答 1

Related

Reference