python - 将 BeautifulSoup 拆分为 2 个汤树

Question

有多种方法可以拆分 beautifulSoup 分析树，获取元素列表或获取标签字符串。但是似乎没有办法在拆分树时保持树的完整性。

我想在 's 上拆分以下片段（汤）。字符串很简单，但我想保留结构，我想要一个分析树列表。

s="""<p>
foo<br />
<a href="http://...html" target="_blank">foo</a> | bar<br />
<a href="http://...html" target="_blank">foo</a> | bar<br />
<a href="http://...html" target="_blank">foo</a> | bar<br />
<a href="http://...html" target="_blank">foo</a> | bar
</p>"""
soup=BeautifulSoup(s)

显然，我可以做一个[BeautifulSoup(i) for i in str(soup).split(' ')]，但我很丑，而且我有太多的链接。

可以在soup.findAll('br') 上使用soup.next 和soup.previousSibling() 进行迭代，但返回的不是分析树，而是它包含的所有元素。

是否有解决方案从 BeautifulSoup-tag 中提取完整的标签子树，保留所有父级和兄弟级关系？

编辑更清楚：

结果应该是一个由 BeautifulSoup-Objects 组成的列表，我可以通过 output[0].a、output[1].text 等进一步向下遍历拆分的汤。在 s 上拆分汤 将返回所有链接的列表以进一步处理，这正是我所需要的。上面片段中的所有链接，带有文本、属性和以下“栏”，是每个链接的描述。

score 0 · Accepted Answer

如果您不介意原始树已更改，我会.extract()在 标签上使用以简单地将它们从树中删除：

>>> for br in soup.find_all('br'): br.extract()
... 
<br/>
<br/>
<br/>
<br/>
>>> soup
<html><body><p>
foo
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
</p></body></html>

这仍然是一个完整的工作树：

>>> soup.p
<p>
foo
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
</p>
>>> soup.p.a
<a href="http://...html" target="_blank">foo</a>

但是您根本不需要删除这些标签来实现您想要的：

for link in soup.find_all('a'):
    print link['href'], ''.join(link.stripped_strings), link.next_sibling

结果是：

>>> for link in soup.find_all('a'):
...     print link['href'], ''.join(link.stripped_strings), link.next_sibling
... 
http://...html foo  | bar
http://...html foo  | bar
http://...html foo  | bar
http://...html foo  | bar

不管我们是否 首先从树中删除了标签。

python - 将 BeautifulSoup 拆分为 2 个汤树

1 回答 1

Related

Reference