我有一个 CountVectorizer:
word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')
实现该矢量化器:
X = word_vectorizer.fit_transform(group['cleanComments'])
引发此错误:
Traceback (most recent call last):
File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
X = word_vectorizer.fit_transform(group['cleanComments'])
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
当 nGram 从中提取的文档是以下字符串时会发生此错误:“duplicate q”。每当文档为“”时,都会发生这种情况。
为什么 CountVectorizer 不将 q (或任何单个字母)作为有效单词?是否有任何全面的地方列出了为 CountVectorizer 引发此错误的可能原因?
编辑:我对错误本身进行了更多挖掘,看起来它与词汇表有关。我假设标准词汇表不接受单个字母作为单词,但我不确定如何解决这个问题。