我有一个data
包含 3 条记录的数据框 ( ):
id text
0001 The farmer plants grain
0002 The fisher catches tuna
0003 The police officer fights crime
我按 id 对该数据框进行分组:
data_grouped = data.groupby('id')
描述生成的 groupby 对象表明所有记录都保留下来。
然后我运行此代码以在 中找到 nGramtext
并将它们加入id
:
word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2),
analyzer='word')
for id, group in data_grouped:
X = word_vectorizer.fit_transform(group['text'])
frequencies = sum(X).toarray()[0]
results = pd.DataFrame(frequencies, columns=['frequency'])
dfinner = pd.DataFrame(word_vectorizer.get_feature_names())
dfinner['id'] = id
final = results.join(dfinner)
当我一起运行所有这些代码时,会出现一个错误word_vectorizer
,指出“空词汇;也许文档只包含停用词”。我知道在许多其他问题中都提到了这个错误,但我找不到一个处理 Dataframe 的问题。
为了使问题进一步复杂化,错误并不总是出现。我从 SQL 数据库中提取数据,根据我提取的记录数,错误可能会出现,也可能不会出现。例如,拉入Top 10
记录会导致错误,但Top 5
不会。
编辑:
完整的追溯
Traceback (most recent call last):
File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
X = word_vectorizer.fit_transform(group['cleanComments'])
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words