python - BeautifulSoup HTMLParseError。这有什么问题？

Question

这是我的代码：

from bs4 import BeautifulSoup as BS
import urllib2
url = "http://services.runescape.com/m=news/recruit-a-friend-for-free-membership-and-xp"
res = urllib2.urlopen(url)
soup = BS(res.read())
other_content = soup.find_all('div',{'class':'Content'})[0]
print other_content

然而出现了一个错误：

/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py:149: RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.
  "Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))
Traceback (most recent call last):
  File "web.py", line 5, in <module>
    soup = BS(res.read())
  File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 172, in __init__
    self._feed()
  File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 185, in _feed
    self.builder.feed(self.markup)
  File "/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py", line 150, in feed
    raise e

我已经让另外两个人使用了这段代码，它对他们来说非常好。为什么它不适合我？我已经安装了bs4...

score 6 · Accepted Answer

根据错误消息，您可能需要做的一件事是 install lxml，它将为 BeautifulSoup 使用提供更强大的解析引擎。请参阅文档中的此部分以获得更好的概述，但它适用于其他两个人的可能原因是他们安装了lxml（或另一个正确处理 HTML 的解析器），这意味着 BeautifulSoup 使用它而不是标准的内置（旁注：您的示例在已安装的系统上也适用于我lxml，但在没有它的系统上失败）。

另外，请参阅文档中的此注释：

如果您使用的是早于 2.7.3 的 Python 2 版本，或早于 3.2.2 的 Python 3 版本，则必须安装 lxml 或 html5lib——Python 的内置 HTML 解析器在旧版本中不是很好版本。

我建议运行sudo apt-get install python-lxml并查看问题是否仍然存在。

python - BeautifulSoup HTMLParseError。这有什么问题？

1 回答 1

Related

Reference