python - BeautifulSoup 找不到正确解析的元素

Question

我BeautifulSoup用来解析一堆可能非常脏的HTML文档。我偶然发现了一件非常奇怪的事情。

它包含多个错误，例如多个<html></html>，<title>在外部<head>等等......

但是，即使在这些情况下，html5lib 通常也能正常工作。事实上，当我这样做时：

soup = BeautifulSoup(document, "html5lib")

我 pretti-print soup，我看到以下输出：http ://pastebin.com/8BKapx88

其中包含很多<a>标签。

但是，当我这样做时，soup.find_all("a")我会得到一个空列表。和lxml我一样。

那么：以前有没有人偶然发现过这个问题？到底是怎么回事？如何获取html5lib找到但未返回的链接find_all？

score 4 · Accepted Answer

即使正确的答案是“使用另一个解析器”（感谢@alecxe），我还有另一个解决方法。出于某种原因，这也有效：

soup = BeautifulSoup(document, "html5lib")
soup = BeautifulSoup(soup.prettify(), "html5lib")
print soup.find_all('a')

它返回相同的链接列表：

soup = BeautifulSoup(document, "html.parser")

score 3 · Accepted Answer

在解析格式不正确且棘手的 HTML 时，解析器的选择非常重要：

HTML解析器之间也存在差异。如果你给 Beautiful Soup 一个格式完美的 HTML 文档，这些差异就无关紧要了。一个解析器会比另一个更快，但它们都会为您提供一个看起来与原始 HTML 文档完全相同的数据结构。

但是如果文档的格式不完美，不同的解析器会给出不同的结果。

html.parser为我工作：

from bs4 import BeautifulSoup
import requests

document = requests.get('http://www.wvdnr.gov/').content
soup = BeautifulSoup(document, "html.parser")
print soup.find_all('a')

演示：

>>> from bs4 import BeautifulSoup
>>> import requests
>>> document = requests.get('http://www.wvdnr.gov/').content
>>>
>>> soup = BeautifulSoup(document, "html5lib")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "lxml")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "html.parser")
>>> len(soup.find_all('a'))
147

也可以看看：

解析器之间的差异。

python - BeautifulSoup 找不到正确解析的元素

2 回答 2

Related

Reference