beautifulsoup - BS4“元素”到底是什么，如何计算元素，由哪个解析器决定？明显糊涂

Question

我现在对一些我认为我理解的事情感到困惑，但事实证明我一直认为这是理所当然的。

人们经常遇到这种类型的for循环：

from bs4 import BeautifulSoup as bs
mystring = 'some string'
soup = bs(mystring,'html.parser')
for elem in soup.find_all():
    [do something with elem]

elem在我遇到这个简化字符串的一个版本之前，我没有过多关注的是实际情况：

mystring = 'opening text<p>text one<BR> text two.<br></p>\
<p align="right">text three<br/> text four.</p><p class="myclass">text five. </p>\
<p>text six <span style="some style">text seven</span></p>\
<p>text 8. <span style="some other style">text nine</span></p>closing text'

我不再确定我期望的输出是什么，但是当我运行这段代码时：

counter = 1 #using 'normal' counting for simplification
for elem in soup.find_all():
    print('elem ',counter,elem)
    counter +=1

输出是：

elem  1 <p>text one<br/> text two.<br/></p>
elem  2 <br/>
elem  3 <br/>
elem  4 <p align="right">text three<br> text four.</br></p>
elem  5 <br> text four.</br>
elem  6 <p class="myclass">text five. </p>
elem  7 <p>text six <span style="some style">text seven</span></p>
elem  8 <span style="some style">text seven</span>
elem  9 <p>text 8. <span style="some other style">text nine</span></p>
elem  10 <span style="some other style">text nine</span>

所以 bs4+html.parser 在字符串中找到了 10 个元素。他们的选择和呈现对我来说似乎不直观（例如，跳过opening text和closing text）。不仅如此，print(len(soup))结果是7!

所以为了确保，我换掉html.parser了lxml和html5lib. 在这两种情况下，print(len(soup))不仅是1，而且elems 的数量也跃升至 13！而且，自然地，额外的元素是不同的。从第 4 次elem到最后，两个库都与html.parser. 然而，对于前三个...

与html5lib您一起获得：

elem  1 <html><head></head><body>opening text<p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body></html>
elem  2 <head></head>
elem  3 <body>opening text<p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body>

lxml另一方面，使用，您会得到：

elem  1 <html><body><p>opening text</p><p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body></html>
elem  2 <body><p>opening text</p><p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body>
elem  3 <p>opening text</p>

那么这一切背后的哲学是什么？是谁的“错”？有“正确”或“错误”的答案吗？而且，实际上，我应该只虔诚地遵循一个解析器，还是每个解析器都有时间和地点？

为问题的长度道歉。

score 2 · Accepted Answer

首先，根对象，在你的例子中，soup变量，是一个BeautifulSoup对象。你可以把它想象document成浏览器中的对象。在 BeautifulSoup 中，BeautifulSoup对象是从对象派生的Element，但它本身并不是真正的“元素”，它更像是文档。

当您调用一个元素（或 BeautifulSoup 对象）时，您将获得该对象成员中的len节点数。contents这可以包含注释、文档处理语句、文本节点、元素节点等。

一个格式良好的文档应该有一个根元素，但注释和文档处理语句在根级别也是可以的。在您的情况下，没有评论和处理语句，我通常期望长度为 1。

lxml并html5lib尝试确保您有一个格式良好的文档，如果它看到您有多个根元素，他们会将其包装html并body标记并给您一个根元素。但是，如前所述，如果您的文档已经具有适当的根html元素并且在根级别还具有注释或处理语句，则长度可能 > 1。根据解析器的不同，他们可能会操纵其他内容以遵守他们在提供奇怪的格式错误的 HTML 时也会强制执行的任何规则。

另一方面。html.parser很宽容。它不会尝试纠正您正在做的事情，而只是按原样解析事物。在您的情况下，它会返回一个奇怪的文档，其中包含根级别的多个文本节点，以及根级别的多个<p>元素。因此，当您调用 length on 时soup，您会得到一个远大于 1 的值。

一般来说。BeautifulSoup 返回的初始元素是BeautifulSoup对象。它可以包含可以是各种子类型的Element节点或NaviagableString节点（文本），这取决于它们是注释、文档声明、CDATA 还是其他处理语句。NaviagableStrings（和相关的子类型）不是Element节点，但通常包含在Element或BeautifulSoup对象的内容中。

根据您是否喜欢宽大处理、速度、HTML5 正确性、XML 支持等，它可能会影响您希望使用的解析器。此外，您有时可能希望将其他解析器用于非常特定的用例。

beautifulsoup - BS4“元素”到底是什么，如何计算元素，由哪个解析器决定？明显糊涂

1 回答 1

Related

Reference