python - fdist 和前 10 个虚词

Question

我必须编写一个脚本，以频率降序为我提供所有内容词。我需要 10 个最常见的实词，因此我不仅需要列出我的语料库中 10 个最常见的词，还需要过滤掉任何实词（和，或，任何标点符号......）。到目前为止我所拥有的是以下

fileids=corpus.fileids ()
text=corpus.words(fileids)
wlist=[]
ftable=nltk.FreqDist (text)
wlist.append(ftable.keys () )

这给了我一个按频率降序排列的非常简洁的所有单词列表，但是我如何过滤掉功能词呢？

谢谢你。

score 1 · Accepted Answer

您想过滤掉一组单词（停用词）。从这个 SO 答案中获取核心思想：

您需要在代码中引入几行代码：

fileids=corpus.fileids ()
text=corpus.words(fileids)

添加以下行：创建停用词列表并从文本中过滤掉它们

#get a list of the stopwords
stp = nltk.corpus.stopwords.words('english')

#from your text of words, keep only the ones NOT in stp
filtered_text = [w for w in text if not w in stp]

现在，像你一样继续

wlist=[]
ftable=nltk.FreqDist (filtered_text)
wlist.append(ftable.keys () )

希望有帮助。

python - fdist 和前 10 个虚词

1 回答 1

Related

Reference