python - 解析特定网站会使 Python 进程崩溃

Question

希望解析图像的 HTML 页面（来自http://www.z-img.com），当我将页面加载到 BeautifulSoup (bs4) 中时，Python 崩溃了。“问题详细信息”显示这etree.pyd是“故障模块名称”，这意味着它可能是一个解析错误，但到目前为止，我还不能完全确定它的原因。

这是我在 Python2.7 上可以归结为最简单的代码：

import requests, bs4

url = r"http://z-img.com/search.php?&ssg=off&size=large&q=test"
r = requests.get(url)
html = r.content
#or 
#import urllib2
#html = urllib2.urlopen(url).read()
soup  = bs4.BeautifulSoup(html)

在我通过 JsBeautifier.com 之后，连同 PasteBin ( http://pastebin.com/XYT9g4Lb ) 上的示例输出。

score 1 · Accepted Answer

1

这是在lxml2.3.5 版本中修复的错误。升级到 2.3.5 或更高版本。

于 2013-04-28T04:37:24.100 回答

score 0 · Accepted Answer

哦，你去，自然我提交问题后的第一件事就是解决方案：<!DOCTYPE>标签似乎是它的根源。我创建了一个新的 HTML 文件 temp.html：

<!DOCTYPE>
<html>
</html>

并将其作为 HTML 字符串传递给 BeautifulSoup，这足以让 Python 再次崩溃。所以我只需要在将来将 HTML 传递给 BeautifulSoup 之前删除该标签：

import requests, bs4

url = r"http://z-img.com/search.php?&ssg=off&size=large&q=test"
r = requests.get(url)
html = r.content
#or 
#import urllib2
#html = urllib2.urlopen(url).read()

#replace the declaration with nothing, and my problems are solved
html = html.replace(r"<!DOCTYPE>", "")
soup  = bs4.BeautifulSoup(html)

希望这可以节省其他人一些时间。

python - 解析特定网站会使 Python 进程崩溃

2 回答 2

Related

Reference