python - 按标签顺序分组html内容

Question

我有一个类似于以下内容的 html 文件：

    <h2>section 1</h2>
    <p>para 1</p>
    <p>para 2</p>
    <p>para 3</p>
    <h2>section 2</h2>
    <p>para 1</p>
    <p>para 2</p>
    <p>para 3</p>
    <h2>section 3</h2>
    <p>para 1</p>
    <p>para 2</p>
    <p>para 3</p>

我想将它们刮到 python 字典{'section1':'...', 'section2':'...', 'section3':'...'}中：当然我可以设置一个current_section变量并使用一个 while 循环，但是有没有用于此目的的模块？我检查了 BeautifulSoup，但没有找到捷径。

谢谢！

score 1 · Accepted Answer

据我所知，没有任何内容soup.group_by_header()，但是对于您描述的输入，您想要在任何情况下都相当简单地实现：

>>> from bs4 import BeautifulSoup     
>>> html = """
...     <h2>section 1</h2>
...     <p>para 1</p>
...     <!-- etc. -->
... """
>>> soup = BeautifulSoup(html)
>>> sections = {}
>>> for header in soup("h2"):
...     paras = []
...     for sibling in header.find_next_siblings(text=False):
...         if sibling.name == "h2":
...             break
...         paras.append(sibling.string)
...     sections[header.string] = paras
... 
>>> sections
{u'section 1': [u'para 1', u'para 2', u'para 3'],
 u'section 2': [u'para 1', u'para 2', u'para 3'],
 u'section 3': [u'para 1', u'para 2', u'para 3']}
>>>

这种方法是否由于某种原因存在问题，或者您只是想知道是否有一些聪明的 BeautifulSoup 方法可以解决这个问题（公平地说，其中有一些）？

score 0 · Accepted Answer

我想你想要string内置的split方法。如果你得到的文字在html_string你可以做

sections = html_string.split('<h2>')  #this deletes the opening h2 tag
for section in sections:
    section = '<h2>' + section   #replace the opening h2 tag
    #code to parse each section goes here

这应该比使用while循环更干净。

python - 按标签顺序分组html内容

2 回答 2

Related

Reference