0

我有一个data包含 3 条记录的数据框 ( ):

id    text
0001  The farmer plants grain
0002  The fisher catches tuna
0003  The police officer fights crime

我按 id 对该数据框进行分组:

data_grouped = data.groupby('id')

描述生成的 groupby 对象表明所有记录都保留下来。

然后我运行此代码以在 中找到 nGramtext并将它们加入id

word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2), 
analyzer='word')
for id, group in data_grouped:
       X = word_vectorizer.fit_transform(group['text'])
       frequencies = sum(X).toarray()[0]
       results = pd.DataFrame(frequencies, columns=['frequency'])
       dfinner = pd.DataFrame(word_vectorizer.get_feature_names())
       dfinner['id'] = id
       final = results.join(dfinner)

当我一起运行所有这些代码时,会出现一个错误word_vectorizer,指出“空词汇;也许文档只包含停用词”。我知道在许多其他问题中都提到了这个错误,但我找不到一个处理 Dataframe 的问题。

为了使问题进一步复杂化,错误并不总是出现。我从 SQL 数据库中提取数据,根据我提取的记录数,错误可能会出现,也可能不会出现。例如,拉入Top 10记录会导致错误,但Top 5不会。

编辑:

完整的追溯

Traceback (most recent call last):

  File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
    runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
    execfile(filename, namespace)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
    X = word_vectorizer.fit_transform(group['cleanComments'])

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
    self.fixed_vocabulary_)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"

ValueError: empty vocabulary; perhaps the documents only contain stop words
4

1 回答 1

2

我看到这里发生了什么,但在运行它时我有一个唠叨的问题。你为什么做这个?我不太确定我是否理解将 CountVectorizer 拟合到文档集合中的每个文档的价值。通常的想法是将它适合整个语料库,然后从那里进行分析。我知道也许您希望能够查看每个文档中存在哪些克,但还有其他更简单和优化的方法可以做到这一点。例如:

df = pd.DataFrame({'id': [1,2,3], 'text': ['The farmer plants grain', 'The fisher catches tuna', 'The police officer fights crime']})
cv = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(df.text)
print(cv.get_feature_names())
['catches tuna',
 'farmer plants',
 'fights crime',
 'fisher catches',
 'officer fights',
 'plants grain',
 'police officer',
 'the farmer',
 'the fisher',
 'the police']
print(dt_mat.todense())
[[0 1 0 0 0 1 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 0]
 [0 0 1 0 1 0 1 0 0 1]]

太好了,因此您可以看到 CountVectorizer 提取的特征以及每个文档中存在哪些特征的矩阵表示。dt_mat是文档术语矩阵,表示每个文档的词汇表(特征)中每个克(频率)的计数。要将其映射回克数,甚至将其放入 DataFrame 中,您可以执行以下操作:

df['grams'] = cv.inverse_transform(dt_mat)
print(df)
   id                             text  \
0   1          The farmer plants grain
1   2          The fisher catches tuna
2   3  The police officer fights crime

                                               grams
0          [plants grain, farmer plants, the farmer]
1         [catches tuna, fisher catches, the fisher]
2  [fights crime, officer fights, police officer,...

就个人而言,这感觉更有意义,因为您正在将 CountVectorizer 拟合到整个语料库,而不仅仅是一次单个文档。您仍然可以提取相同的信息(频率和克数),并且随着您在文档中扩大规模,这会更快。

于 2017-05-12T17:51:12.773 回答