python - 在执行 TF-IFcosine 相似性时添加 stop_words

Question

我正在使用 sklearn 来执行余弦相似度。

有没有办法将所有以大写字母开头的单词视为停用词？

score 0 · Accepted Answer

以下正则表达式将输入一个字符串，并删除/替换所有以空字符串开头的字母数字字符序列。有关更多选项，请参阅http://docs.python.org/2.7/library/re.html。

s1 = "The cat Went to The store To get Some food doNotMatch"
r1 = re.compile('\\b[A-Z]\w+')
r1.sub('',s1)
' cat  to  store  get  food doNotMatch'

Sklearn 还具有许多用于文本特征生成的强大工具，例如 sklearn.feature_extraction.text 您可能还想考虑使用 NLTK 来辅助句子分割、删除停用词等...

1 回答 1