python - 使用 html5lib.sanitizer 完全删除坏标签

Question

我正在尝试按照文档中的建议使用 html5lib.sanitizer 来清理用户输入

问题是我想完全删除坏标签，而不仅仅是逃避它们（无论如何这似乎是个坏主意）。

此处补丁中建议的解决方法没有按预期工作（它保留 a 的内部内容<tag>content</tag>）。

具体来说，我想做这样的事情：

输入：

<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world</h1>
Lorem ipsum

输出：

<h1>Hello world</h1>
Lorem ipsum

关于如何实现它的任何想法？我尝试过BeautifulSoup，但似乎效果不佳，并且lxml<p></p>在非常奇怪的地方（例如src attrs 周围）插入标签。到目前为止，html5lib 似乎是最好的选择，如果我可以让它删除标签而不是转义它们。

score 1 · Accepted Answer

挑战还在于去除不需要的嵌套标签。它并不漂亮，但它是朝着正确方向迈出的一步：

from lxml.html import fromstring
from lxml import etree

html = '''
<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world<script>bad_thing();</script></h1>
Lorem ipsum
<script>bad_thing();</script>
<b>Bold Text</b>
'''

l = []
doc = fromstring(html)
for el in doc.xpath(".//h1|.//b"):
    i = etree.Element(el.tag)
    i.text, i.tail = el.text, el.tail
    l.append(etree.tostring(i))

print ''.join(l)

哪个输出：

<h1>Hello world</h1>
Lorem ipsum
<b>Bold Text</b>

python - 使用 html5lib.sanitizer 完全删除坏标签

1 回答 1

Related

Reference