python - 如何使用python和漂亮的汤将一个html页面拆分为多个页面

Question

我有一个像这样的简单 html 文件。事实上，我从 wiki 页面中提取了它，删除了一些 html 属性并转换为这个简单的 html 页面。

<html>
   <body>
      <h1>draw electronics schematics</h1>
      <h2>first header</h2>
      <p>
         <!-- ..some text images -->
      </p>
      <h3>some header</h3>
      <p>
         <!-- ..some image -->
      </p>
      <p>
         <!-- ..some text -->
      </p>
      <h2>second header</h2>
      <p>
         <!-- ..again some text and images -->
      </p>
   </body>
</html>

我使用 python 和这样的美丽汤阅读了这个 html 文件。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("test.html"))

pages = []

我想做的是将此html页面分成两部分。第一部分将在第一个标题和第二个标题之间。第二部分将在第二个标题 <h2> 和 </body> 标记之间。然后我想将它们存储在一个列表中，例如。页。所以我可以根据 <h2> 标签从一个 html 页面创建多个页面。

关于我应该如何做到这一点的任何想法？谢谢..

score 5 · Accepted Answer

寻找h2标签，然后.next_sibling 用来抓取所有东西，直到它成为另一个h2标签：

soup = BeautifulSoup(open("test.html"))
pages = []
h2tags = soup.find_all('h2')

def next_element(elem):
    while elem is not None:
        # Find next element, skip NavigableString objects
        elem = elem.next_sibling
        if hasattr(elem, 'name'):
            return elem

for h2tag in h2tags:
    page = [str(h2tag)]
    elem = next_element(h2tag)
    while elem and elem.name != 'h2':
        page.append(str(elem))
        elem = next_element(elem)
    pages.append('\n'.join(page))

使用您的示例，这给出了：

>>> pages
['<h2>first header</h2>\n<p>\n<!-- ..some text images -->\n</p>\n<h3>some header</h3>\n<p>\n<!-- ..some image -->\n</p>\n<p>\n<!-- ..some text -->\n</p>', '<h2>second header</h2>\n<p>\n<!-- ..again some text and images -->\n</p>']
>>> print pages[0]
<h2>first header</h2>
<p>
<!-- ..some text images -->
</p>
<h3>some header</h3>
<p>
<!-- ..some image -->
</p>
<p>
<!-- ..some text -->
</p>

python - 如何使用python和漂亮的汤将一个html页面拆分为多个页面

1 回答 1

Related

Reference