python - 处理 HTML 以删除和关闭 Python 中的打开标签

Question

我正在尝试使用 HTMLParser 处理没有结束标签或在 Python 中使用无效结束标签的 HTML：

入口：

<div>
  <p>foo 
</div>
bar</span>

输出：（关闭打开的标签并打开错误的关闭）

<div>
  <p>foo</p>
</div>
<span>bar</span>

甚至：（删除关闭而不立即打开和关闭所有打开的标签）

<div>
  <p>foo bar</p>
</div>

我的代码只关闭打开的标签，但不能在 HTMLParser 的循环中编辑 HTML。

from HTMLParser import HTMLParser

singleton_tags = [
  'area','base','br','col','command','embed','hr',
  'img', 'input','link','meta','param','source'
]

class HTMLParser_(HTMLParser):

    def __init__(self, *args, **kwargs):
        HTMLParser.__init__(self, *args, **kwargs)
        self.open_tags = []

    # Handle opening tag
    def handle_starttag(self, tag, attrs):
        if tag not in singleton_tags:
            self.open_tags.append(tag)

    # Handle closing tag
    def handle_endtag(self, tag):
        if tag not in singleton_tags:
            self.open_tags.pop()

def close_tags(text):
    parser = HTMLParser_()

    # Mounts stack of open tags
    parser.feed(text)

    # Closes open tags
    text += ''.join('</%s>'%tag for tag in parser.open_tags)

    return text

score 2 · Accepted Answer

我建议调查BeautifulSoup。它是我使用过的最好的 HTML 解析器（适用于任何语言），并且使在 Python 中使用 HTML 变得非常容易。

有一个prettify功能可能对您有用。查看标题为打印文档的部分。

python - 处理 HTML 以删除和关闭 Python 中的打开标签

1 回答 1

Related

Reference