python - python 是否有类似 NLTK 的东西，并且不需要安装？

Question

我正在使用 NLTK 去除标签并将文本保留在 html 文件中。

NLTK 可以在几秒钟内安装在我的 linux 计算机上，但在 Windows 上使用起来很痛苦，而且我知道如果我在安装 nltk 模块时遇到问题，我住在不同国家的客户将无法安装它。

什么是 python 附带且不需要安装的简单替代方案？我需要这个作为脚本的一部分。

score 1 · Accepted Answer

问题是“如何从字符串中删除 HTML 标记？”

import re
def strip_tags(s):
    return re.sub("<[^>]+>", "", s)

此外，为了将来参考，您只需要Christoph Gohlke 的 Python Extensions for Windows页面。

编辑：修复了正则表达式。丁：

双重编辑：受评论的启发，这是可憎的。

def strip_tags(s):
     return re.sub(r"""</?\w+(\s*([^=]+=(?P<q>['"]).+?(?P=q))|\s*\w+(=\w+)?)*>""", "", s)

score 0 · Accepted Answer

你可以试试：

import xml.etree.ElementTree as ET

root = ET.parser('whatever')
text = filter(None, ((el.text or '').strip() for el in root.findall('.//*')))

那么你做什么text取决于你。

2 回答 2