python - 将 TfidfVectorizer 稀疏矩阵转换为数据帧或密集数组会导致内存错误

Question

我的输入是一个 pandas 数据框（“向量”），它有一列和 178885 行，其中包含最多 600 个单词的字符串。

0         this is an example text...
1         more examples...
          ...
178885    last example
Name: vectortext, Length: 178886, dtype: object

我正在使用 TfidfVectorizer 进行特征提取（unigrams）：

vectorizer_uni = TfidfVectorizer(ngram_range=(1,1), use_idf=True, analyzer="word", stop_words=stop)
X = vectorizer_uni.fit_transform(vector).toarray()
X = pd.DataFrame(X, columns=vectorizer_uni.get_feature_names()) #map grams 
k = len(X.columns) #number of features

不幸的是，我收到如下内存错误。我在 Windows 10 机器上使用 64 位版本的 python 3.6 和 16GB RAM。我对 python 生成器等很感兴趣，但我不知道如何在不限制功能数量的情况下解决这个问题（这不是一个真正的选择）。任何想法如何解决这个问题？我以前可以以某种方式拆分我的数据框吗？

追溯：

---------------------------------------------------------------------------
 MemoryError                               Traceback (most recent call last)
       <ipython-input-88-15b6091ceec7> in <module>()
       1 vectorizer_uni = TfidfVectorizer(ngram_range=(1,1), use_idf=True, analyzer="word", stop_words=stop)
 ----> 2 X = vectorizer_uni.fit_transform(vector).toarray()
       3 X = pd.DataFrame(X, columns=vectorizer_uni.get_feature_names()) #map grams
       4 k = len(X.columns) # number of features

 C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out)
       962     def toarray(self, order=None, out=None):
       963         """See the docstring for `spmatrix.toarray`."""
   --> 964         return self.tocoo(copy=False).toarray(order=order, out=out)
       965 
       966     ##############################################################

 C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\coo.py in toarray(self, order, out)
       250     def toarray(self, order=None, out=None):
       251         """See the docstring for `spmatrix.toarray`."""
   --> 252         B = self._process_toarray_args(order, out)
       253         fortran = int(B.flags.f_contiguous)
       254         if not fortran and not B.flags.c_contiguous:

 C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\base.py in _process_toarray_args(self, order, out)
       1037             return out
       1038         else:
    -> 1039             return np.zeros(self.shape, dtype=self.dtype, order=order)
       1040 
       1041     def __numpy_ufunc__(self, func, method, pos, inputs, **kwargs):

 MemoryError:

python - 将 TfidfVectorizer 稀疏矩阵转换为数据帧或密集数组会导致内存错误

0 回答 0

Related

Reference