jquery - 如何在 python 中将 HTML 字符串解析为 HTML DOM 元素？

Question

我有一串 HTML 元素

HTMLstr = """
    <div class='column span4 ui-sortable' id='column1'></div>
    <div class='column span4 ui-sortable' id='column2'>
        <div class='portlet ui-widget ui-widget-content ui-helper-clearfix ui-corner-all' id='widget_basicLine'>
        <div class='portlet-header ui-widget-header ui-corner-all'><span class='ui-icon ui-icon-minusthick'></span>Line Chart </div>
        <div class='portlet-content' id=basicLine style='height:270px; margin: 0 auto;'></div>          
        </div>
    </div>
    <div class='column span4 ui-sortable' id='column3'></div> """

我想将上面的 HTML 字符串转换为 python 中的相应 HTML DOM 元素？

我可以在 jQuery/AJAX 函数中通过$(this).html(HTMLstr);但如何在 python 中解析它？

score 6 · Accepted Answer

Python 具有用于解析 HTML 文档的内置库。在 Python 2.x 中，您可以选择HTMLParser(recommended) 和htmllib(deprecated)；在 Python 3.x 中，html.parser是适当的库（这是HTMLParserPython 2.x 的重命名版本）。

但是，这些是事件驱动的解析器（类似于 XML SAX 解析器），可能不是您想要的。如果您知道文档将是有效的 XML（即正确关闭的标签等），另一种方法是使用 Python 的 XML 解析工具之一。库xml.dom和xml.dom.minidom都是选项，具体取决于您要查找的解析类型（我怀疑xml.dom.minidom对于您的目的来说已经足够了，给定您的示例）。

例如，您应该能够在 Python 控制台中输入它并获得显示的输出：

>>> import xml.dom.minidom
>>> x = xml.dom.minidom.parseString('<div class="column span4 ui-sortable" id="column2"><div class="portlet ui-widget ui-widget-content ui-helper-clearfix ui-corner-all" id="widget_basicLine" /></div>')
>>> x.documentElement.nodeName
'div'
>>> x.documentElement.getAttribute("class")
'column span4 ui-sortable'
>>> len(x.documentElement.firstChild.childNodes)
0

此处提供了您收到的节点对象的完整描述。如果你习惯在 JavaScript 中使用 DOM，你应该会发现大部分属性都是一样的。请注意，由于 Python 将其视为 XML 文档，因此诸如“类”之类的特定于 HTML 的属性没有特殊意义，因此我相信您必须使用该getAttribute函数来访问它们。

score 2 · Accepted Answer

您应该使用 BeautifulSoup - 完全符合您的需要。

http://www.crummy.com/software/BeautifulSoup/

jquery - 如何在 python 中将 HTML 字符串解析为 HTML DOM 元素？

2 回答 2

Related

Reference