python - 从字符串中删除 html 图像标签和介于两者之间的所有内容

Question

我已经看到了许多关于从字符串中删除 HTML 标记的问题，但我仍然不清楚应该如何处理我的具体案例。

我看到许多帖子建议不要使用正则表达式来处理 HTML，但我怀疑我的案例可能需要明智地规避这条规则。

我正在尝试解析 PDF 文件，并且成功地将示例 PDF 文件中的每一页转换为一串 UTF-32 文本。当图像出现时，会插入一个 HTML 样式的标签，其中包含图像的名称和位置（保存在别处）。

在我的应用程序的一个单独部分中，我需要摆脱这些图像标签。因为我们只处理图像标签，我怀疑可能需要使用正则表达式。

我的问题是双重的：

我应该使用正则表达式来删除这些标签，还是应该仍然使用 HTML 解析模块，例如 BeautifulSoup？
我应该使用哪个正则表达式或 BeautifulSoup 构造？换句话说，我应该如何编码？

为清楚起见，标签的结构如下<img src="/path/to/file"/>

谢谢！

score 15 · Accepted Answer

我会投票认为在你的情况下使用正则表达式是可以接受的。像这样的东西应该工作：

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

我在这里找到了那个片段（http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html）

编辑：只会删除以下形式的版本<img .... />：

def remove_img_tags(data):
    p = re.compile(r'<img.*?/>')
    return p.sub('', data)

score 3 · Accepted Answer

由于此文本仅包含图像标签，因此使用正则表达式可能没问题。但是对于其他任何事情，您最好使用真正的 HTML 解析器。幸运的是 Python 提供了一个！这是非常简单的——要完全发挥作用，它必须处理更多的极端情况。（最值得注意的是，XHTML 样式的空标签（以斜杠结尾<... />）在这里没有得到正确处理。）

>>> from HTMLParser import HTMLParser
>>> 
>>> class TagDropper(HTMLParser):
...     def __init__(self, tags_to_drop, *args, **kwargs):
...         HTMLParser.__init__(self, *args, **kwargs)
...     self._text = []
...         self._tags_to_drop = set(tags_to_drop)
...     def clear_text(self):
...         self._text = []
...     def get_text(self):
...         return ''.join(self._text)
...     def handle_starttag(self, tag, attrs):
...         if tag not in self._tags_to_drop:
...             self._text.append(self.get_starttag_text())
...     def handle_endtag(self, tag):
...         self._text.append('</{0}>'.format(tag))
...     def handle_data(self, data):
...         self._text.append(data)
... 
>>> td = TagDropper([])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an <img url="foo"> tag
Another line of text with a <br> tag

并删除img标签...

>>> td = TagDropper(['img'])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an  tag
Another line of text with a <br> tag

score 0 · Accepted Answer

我的解决方案是：

def remove_HTML_tag(tag, string):
    string = re.sub(r"<\b(" + tag + r")\b[^>]*>", r"", string)
    return re.sub(r"<\/\b(" + tag + r")\b[^>]*>", r"", string)

python - 从字符串中删除 html 图像标签和介于两者之间的所有内容

3 回答 3

Related

Reference