numpy - 处理大量独特词进行文本处理/tf-idf 等

Question

我正在使用 scikit 做一些文本处理，比如 tfidf。文件名的数量处理得很好（~40k）。但就唯一词的数量而言，我无法处理数组/矩阵，无论是获取打印的唯一词数量的大小，还是将 numpy 数组转储到文件中（使用 savetxt） . 下面是回溯。如果我能获得 tfidf 的最高值，因为我不需要它们用于每个文档的每个单词。或者，我可以从计算中排除其他单词（不是停用词，而是我可以添加的文本文件中的一组单独的单词，这些单词将被排除）。不过，我不知道我会说的话是否会缓解这种情况。最后，如果我能以某种方式抓取矩阵的碎片，那也可以。任何处理这种事情的例子都会有所帮助，并给我一些想法的起点。（PS 我看了看并尝试了 Hashingvectorizer 但似乎我不能用它做 tfidf？）

Traceback (most recent call last):
  File "/sklearn.py", line 40, in <module>
    array = X.toarray()
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 790, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 239, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/base.py", line 699, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
ValueError: array is too big.

相关代码：

path = "/home/files/"

fh = open('output.txt','w')


filenames = os.listdir(path)

filenames.sort()

try:
    filenames.remove('.DS_Store')
except ValueError:
    pass # or scream: thing not in some_list!
except AttributeError:
    pass # call security, some_list not quacking like a list!

vectorizer = CountVectorizer(input='filename', analyzer='word', strip_accents='unicode', stop_words='english') 
X=vectorizer.fit_transform(filenames)
fh.write(str(vectorizer.vocabulary_))

array = X.toarray()
print array.size
print array.shape

编辑：如果这有帮助，

print 'Array is:' + str(X.get_shape()[0])  + ' by ' + str(X.get_shape()[1]) + ' matrix.'

获取太大稀疏矩阵的维度，在我的例子中：

Array is: 39436 by 113214 matrix.

score 1 · Accepted Answer

回溯在这里给出了答案：当您X.toarray()最后调用时，它将稀疏矩阵表示转换为密集表示。这意味着您现在不是为每个文档中的每个单词存储一个恒定数量的数据，而是为所有文档中的所有单词存储一个值。

值得庆幸的是，大多数操作都使用稀疏矩阵，或者具有稀疏变体。只是避免打电话.toarray()或.todense()你会很高兴。

有关更多信息，请查看scipy 稀疏矩阵文档。

numpy - 处理大量独特词进行文本处理/tf-idf 等

1 回答 1

Related

Reference