python - 如何在Python中遍历字符串的句子？

Question

假设我有一个字符串text = "A compiler translates code from a source language"。我想做两件事：

我需要使用库遍历每个单词和词干NLTK。词干提取的功能是PorterStemmer().stem_word(word)。我们必须传递参数'word'。我怎样才能阻止每个单词并取回被阻止的句子？
我需要从text字符串中删除某些停用词。包含停用词的列表存储在文本文件中（空格分隔）
```
stopwordsfile = open('c:/stopwordlist.txt','r+')
stopwordslist=stopwordsfile.read()
```
如何从中删除这些停用词text并获得干净的新字符串？

score 9 · Accepted Answer

我将此作为评论发布，但我认为我不妨将其充实为一个完整的答案并进行一些解释：

您想使用str.split()将字符串拆分为单词，然后对每个单词进行词干：

for word in text.split(" "):
    PorterStemmer().stem_word(word)

由于您想将所有词干组合在一起，因此将这些词干重新组合在一起是微不足道的。为了轻松有效地做到这一点，我们使用str.join()和生成器表达式：

" ".join(PorterStemmer().stem_word(word) for word in text.split(" "))

编辑：

对于您的其他问题：

with open("/path/to/file.txt") as f:
    words = set(f)

在这里，我们使用语句打开文件（with这是打开文件的最佳方式，因为它可以正确处理关闭它们，即使在异常情况下，并且更具可读性）并将内容读入一个集合。我们使用一个集合，因为我们不关心单词的顺序，或者重复，以后会更有效率。我假设每行一个单词 - 如果不是这种情况，并且它们是逗号分隔或空格分隔的，那么str.split()像我们之前所做的那样使用（带有适当的参数）可能是一个好计划。

stems = (PorterStemmer().stem_word(word) for word in text.split(" "))
" ".join(stem for stem in stems if stem not in words)

在这里，我们使用生成器表达式的 if 子句来忽略我们从文件加载的单词集中的单词。对集合的成员资格检查是 O(1)，所以这应该是相对有效的。

编辑2：

要在词干之前删除它们，它甚至更简单：

" ".join(PorterStemmer().stem_word(word) for word in text.split(" ") if word not in words)

删除给定的单词很简单：

filtered_words = [word for word in unfiltered_words if not in set_of_words_to_filter]

score 4 · Accepted Answer

遍历字符串中的每个单词：

for word in text.split():
    PorterStemmer().stem_word(word)

使用字符串的连接方法（由 Lattyware 推荐）将片段连接到一个大字符串。

" ".join(PorterStemmer().stem_word(word) for word in text.split(" "))

python - 如何在Python中遍历字符串的句子？

2 回答 2

Related

Reference