更新
从 scikit-learn 0.14 开始,格式已更改为:
n_grams = CountVectorizer(ngram_range=(1, 5))
完整示例:
test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."
from sklearn.feature_extraction.text import CountVectorizer
c_vec = CountVectorizer(ngram_range=(1, 5))
# input to fit_transform() should be an iterable with strings
ngrams = c_vec.fit_transform([test_str1, test_str2])
# needs to happen after fit_transform()
vocab = c_vec.vocabulary_
count_values = ngrams.toarray().sum(axis=0)
# output n-grams
for ng_count, ng_text in sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True):
print(ng_count, ng_text)
它输出以下内容(请注意,I
删除该词不是因为它是停用词(不是),而是因为它的长度:https ://stackoverflow.com/a/20743758/ ):
> (3, u'to')
> (3, u'from')
> (2, u'ngrams')
> (2, u'need')
> (1, u'words')
> (1, u'trigrams but need better solutions')
> (1, u'trigrams but need better')
...
这几天应该/可能会简单得多,imo。您可以尝试类似的东西textacy
,但有时可能会带来一些复杂的情况,例如初始化 Doc,它目前不适用于 v.0.6.2 ,如他们的 docs 所示。如果 doc 初始化按承诺工作,理论上以下内容会起作用(但不会):
test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."
import textacy
# some version of the following line
doc = textacy.Doc([test_str1, test_str2])
ngrams = doc.to_bag_of_terms(ngrams={1, 5}, as_strings=True)
print(ngrams)
旧答案
WordNGramAnalyzer
自 scikit-learn 0.11 以来确实已弃用。创建 n-gram 和获取词频现在在sklearn.feature_extraction.text.CountVectorizer中结合。您可以创建从 1 到 5 的所有 n-gram,如下所示:
n_grams = CountVectorizer(min_n=1, max_n=5)
更多示例和信息可以在 scikit-learn 关于文本特征提取的文档中找到。