为了在大约 400 MB 的文本数据中运行 NB 分类器,我需要使用矢量化器。
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(X_data)
但它给出了内存不足的错误。我正在使用 Linux64 一个 python 64 位版本。人们如何通过 Scikit 中的矢量化过程处理大型数据集(文本)
Traceback (most recent call last):
File "ParseData.py", line 234, in <module>
main()
File "ParseData.py", line 211, in main
classifier = MultinomialNB().fit(X_train, y_train)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 313, in fit
Y = labelbin.fit_transform(y)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 408, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 272, in transform
neg_label=self.neg_label)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 394, in label_binarize
Y = np.zeros((len(y), len(classes)), dtype=np.int)
已编辑(ogrisel):我将标题从“Scikit Vectorizer 中的内存不足错误”更改为“Scikit-learn MultinomialNB 中的内存不足错误”,以使其更能描述实际问题。