python - 计算 nGram 时，Dataframe 中的 Sklearn CountVectorizer“空词汇”错误

Question

我有一个data包含 3 条记录的数据框 ( )：

id    text
0001  The farmer plants grain
0002  The fisher catches tuna
0003  The police officer fights crime

我按 id 对该数据框进行分组：

data_grouped = data.groupby('id')

描述生成的 groupby 对象表明所有记录都保留下来。

然后我运行此代码以在中找到 nGramtext并将它们加入id：

word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2), 
analyzer='word')
for id, group in data_grouped:
       X = word_vectorizer.fit_transform(group['text'])
       frequencies = sum(X).toarray()[0]
       results = pd.DataFrame(frequencies, columns=['frequency'])
       dfinner = pd.DataFrame(word_vectorizer.get_feature_names())
       dfinner['id'] = id
       final = results.join(dfinner)

当我一起运行所有这些代码时，会出现一个错误word_vectorizer，指出“空词汇；也许文档只包含停用词”。我知道在许多其他问题中都提到了这个错误，但我找不到一个处理 Dataframe 的问题。

为了使问题进一步复杂化，错误并不总是出现。我从 SQL 数据库中提取数据，根据我提取的记录数，错误可能会出现，也可能不会出现。例如，拉入Top 10记录会导致错误，但Top 5不会。

编辑：

完整的追溯

Traceback (most recent call last):

  File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
    runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
    execfile(filename, namespace)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
    X = word_vectorizer.fit_transform(group['cleanComments'])

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
    self.fixed_vocabulary_)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"

ValueError: empty vocabulary; perhaps the documents only contain stop words

score 2 · Accepted Answer

我看到这里发生了什么，但在运行它时我有一个唠叨的问题。你为什么做这个？我不太确定我是否理解将 CountVectorizer 拟合到文档集合中的每个文档的价值。通常的想法是将它适合整个语料库，然后从那里进行分析。我知道也许您希望能够查看每个文档中存在哪些克，但还有其他更简单和优化的方法可以做到这一点。例如：

df = pd.DataFrame({'id': [1,2,3], 'text': ['The farmer plants grain', 'The fisher catches tuna', 'The police officer fights crime']})
cv = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(df.text)
print(cv.get_feature_names())
['catches tuna',
 'farmer plants',
 'fights crime',
 'fisher catches',
 'officer fights',
 'plants grain',
 'police officer',
 'the farmer',
 'the fisher',
 'the police']
print(dt_mat.todense())
[[0 1 0 0 0 1 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 0]
 [0 0 1 0 1 0 1 0 0 1]]

太好了，因此您可以看到 CountVectorizer 提取的特征以及每个文档中存在哪些特征的矩阵表示。dt_mat是文档术语矩阵，表示每个文档的词汇表（特征）中每个克（频率）的计数。要将其映射回克数，甚至将其放入 DataFrame 中，您可以执行以下操作：

df['grams'] = cv.inverse_transform(dt_mat)
print(df)
   id                             text  \
0   1          The farmer plants grain
1   2          The fisher catches tuna
2   3  The police officer fights crime

                                               grams
0          [plants grain, farmer plants, the farmer]
1         [catches tuna, fisher catches, the fisher]
2  [fights crime, officer fights, police officer,...

就个人而言，这感觉更有意义，因为您正在将 CountVectorizer 拟合到整个语料库，而不仅仅是一次单个文档。您仍然可以提取相同的信息（频率和克数），并且随着您在文档中扩大规模，这会更快。

python - 计算 nGram 时，Dataframe 中的 Sklearn CountVectorizer“空词汇”错误

1 回答 1

Related

Reference