由于此文本仅包含图像标签,因此使用正则表达式可能没问题。但是对于其他任何事情,您最好使用真正的 HTML 解析器。幸运的是 Python 提供了一个!这是非常简单的——要完全发挥作用,它必须处理更多的极端情况。(最值得注意的是,XHTML 样式的空标签(以斜杠结尾<... />
)在这里没有得到正确处理。)
>>> from HTMLParser import HTMLParser
>>>
>>> class TagDropper(HTMLParser):
... def __init__(self, tags_to_drop, *args, **kwargs):
... HTMLParser.__init__(self, *args, **kwargs)
... self._text = []
... self._tags_to_drop = set(tags_to_drop)
... def clear_text(self):
... self._text = []
... def get_text(self):
... return ''.join(self._text)
... def handle_starttag(self, tag, attrs):
... if tag not in self._tags_to_drop:
... self._text.append(self.get_starttag_text())
... def handle_endtag(self, tag):
... self._text.append('</{0}>'.format(tag))
... def handle_data(self, data):
... self._text.append(data)
...
>>> td = TagDropper([])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an <img url="foo"> tag
Another line of text with a <br> tag
并删除img
标签...
>>> td = TagDropper(['img'])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an tag
Another line of text with a <br> tag