1

如何通过在 Python中使用来获取<body>元素的内容?html5lib

示例输入数据:<html><head></head><body>xxx<b>yyy</b></hr></body></html>

预期输出:xxx<b>yyy</b></hr>

即使 HTML 被破坏(未封闭的标签,...),它也应该可以工作。

4

1 回答 1

5

html5lib allows you to parse your documents using a variety of standard tree formats. You can do this using lxml, as I've done below, or you can follow the instructions in their user documentation to do it either with minidom, ElementTree or BeautifulSoup.

file = open("mydocument.html")
doc = html5lib.parse(file, treebuilder="lxml")
content = doc.findtext("html/body", default=None):

Response to comment

It is possible to acheive this without installing any external libs using their own simpletree.py, but judging by the comment at the start of the file I would guess this is not the recommended way...

# Really crappy basic implementation of a DOM-core like thing

If you still want to do this, however, you can parse the html document like so:

f = open("mydocument.html")
doc = html5lib.parse(f) 

and then find the element you're looking for by doing a breadth-first search of the child nodes in the document. The nodes are kept in an array named childNodes and each node has a name stored in the field name.

于 2011-05-28T11:59:07.473 回答