python - 如何在 Python 中为 HTML 文本生成目录？

Question

假设我有一些 HTML 代码，像这样（从 Markdown 或 Textile 或其他东西生成）：

<h1>A header</h1>
<p>Foo</p>
<h2>Another header</h2>
<p>More content</p>
<h2>Different header</h2>
<h1>Another toplevel header
<!-- and so on -->

如何使用 Python 为其生成目录？

score 6 · Accepted Answer

6

使用诸如lxml或BeautifulSoup之类的 HTML 解析器来查找所有标题元素。

于 2010-02-05T20:41:09.480 回答

score 3 · Accepted Answer

这是一个使用 lxml 和 xpath 的示例。

from lxml import etree
doc = etree.parse("test.xml")
for node in doc.xpath('//h1|//h2|//h3|//h4|//h5'):
    print node.tag, node.text

2 回答 2