python - 将所有 next_elements 包裹在 BeautifulSoup 中

Question

我有一段像这样的HTML：

<figure>
    <img src=".." alt=".." />
    Some text that I have to wrap in <code>figcaption</code>
</figure>

我正在尝试将后面<img>的所有内容包装在<figcaption>. 那可能吗？

next_elements可以很好地获取我想要的元素，但返回一个生成器，它不能很好地与该wrap方法配合使用。

score 2 · Accepted Answer

这是一种方法：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <figure>
...     <img src=".." alt=".." />
...     Some text that I have to wrap in <code>figcaption</code>
... </figure>
... """)
>>> for figure in soup.find_all("figure"):
...     img = figure.find("img")
...     if img is not None:
...         figcaption = soup.new_tag("figcaption")
...         for el in list(img.next_siblings):
...             figcaption.append(el)
...         img.insert_after(figcaption)
... 
>>> soup
<html><body><figure>
    <img alt=".." src=".."/><figcaption>
    Some text that I have to wrap in <code>figcaption</code>
</figcaption></figure></body></html>

需要注意的几点：

我们使用next_siblings，它只返回我们实际需要的元素，而不是next_elements，它将继续超过figure元素的末尾。
next_siblings我们用wraplist()来创建一个可以迭代的浅拷贝——否则，由于附加el到的行为figcaption会将其从文档树中的先前位置删除，这将修改我们将要迭代的序列，这是一个坏主意。我们本可以使用find_next_siblings()（它也返回一个列表），但上面的版本更明确。
由于我们已经从文档树中的先前位置删除了所有 next-siblings ，img因此我们需要做的就是在元素figcaption之后立即追加（现在包含它们）。img
空白的位置不再直观地对人类“正确”，但修复它需要大量的额外工作，并且可能不值得。

python - 将所有 next_elements 包裹在 BeautifulSoup 中

1 回答 1

Related

Reference