我正在尝试使用 SkLearn 的 TfidfVectorizer 提取 unigrams、bigrams 和 trigrams 的词汇表。这是我当前的代码:
max_df_param = .003
use_idf = True
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
unigrams = vectorizer.get_feature_names()
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(2,2), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
bigrams = vectorizer.get_feature_names()
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(3,3), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
trigrams = vectorizer.get_feature_names()
vocab = np.concatenate((unigrams, bigrams, trigrams))
但是,我想避免包含数字的数字和单词,并且当前输出包含诸如“0 101 110 12 15th 16th 180c 180d 18th 190 1900 1960s 197 1980 1b 20 200 200a 2d 3d 416 4th 50 7a 7b”之类的术语
我尝试使用token_pattern
带有以下正则表达式的参数仅包含带有字母字符的单词:
vectorizer = TfidfVectorizer(max_df = max_df_param,
token_pattern=u'(?u)\b\^[A-Za-z]+$\b',
stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
但这会返回:ValueError: empty vocabulary; perhaps the documents only contain stop words
我也试过只删除数字,但我仍然得到同样的错误。
我的正则表达式不正确吗?还是我使用TfidfVectorizer
不正确?(我也尝试过删除max_features
参数)
谢谢!