python - 使用 html5lib 或漂白剂删除标签的内容

Question

我一直在使用出色的漂白库来删除不良的 HTML。

我有大量从 Microsoft Word 粘贴的 HTML 文档，其中包含以下内容：

<STYLE> st1:*{behavior:url(#ieooui) } </STYLE>

使用漂白剂（style隐含禁止使用标签），给我留下：

st1:*{behavior:url(#ieooui) }

这没有帮助。漂白剂似乎只能选择：

转义标签；
删除标签（但不是它们的内容）。

我正在寻找第三种选择 - 删除标签及其内容。

有没有办法使用漂白剂或 html5lib 完全删除style标签及其内容？html5lib的文档并没有太多帮助。

score 7 · Accepted Answer

事实证明lxml这是完成这项任务的更好工具：

from lxml.html.clean import Cleaner

def clean_word_text(text):
    # The only thing I need Cleaner for is to clear out the contents of
    # <style>...</style> tags
    cleaner = Cleaner(style=True)
    return cleaner.clean_html(text)

score 1 · Accepted Answer

我能够使用基于这种方法的过滤器去除标签的内容：https ://bleach.readthedocs.io/en/latest/clean.html?highlight=strip#html5lib-filters-filters 。它确实<style></style>在输出中留下了一个空白，但这是无害的。

from bleach.sanitizer import Cleaner
from bleach.html5lib_shim import Filter

class StyleTagFilter(Filter):
    """
    https://bleach.readthedocs.io/en/latest/clean.html?highlight=strip#html5lib-filters-filters
    """

    def __iter__(self):
        in_style_tag = False
        for token in Filter.__iter__(self):
            if token["type"] == "StartTag" and token["name"] == "style":
                in_style_tag = True
            elif token["type"] == "EndTag":
                in_style_tag = False
            elif in_style_tag:
                # If we are in a style tag, strip the contents
                token["data"] = ""
            yield token


# You must include "style" in the tags list
cleaner = Cleaner(tags=["div", "style"], strip=True, filters=[StyleTagFilter])
cleaned = cleaner.clean("<div><style>.some_style { font-weight: bold; }</style>Some text</div>")

assert cleaned == "<div><style></style>Some text</div>"

python - ...使用 html5lib 或漂白剂删除标签的内容

2 回答 2

Related

Reference

python - 使用 html5lib 或漂白剂删除标签的内容