scipy - tf idf 上的截断 svd 给出值错误数组太大

Question

我正在尝试将 TruncatedSVD.fit_transform() 应用于 scikit-learn 中 TfidfVectorizer 给出的稀疏矩阵，它给出：

    tsv = TruncatedSVD(n_components=10000,algorithm='randomized',n_iterations=5)
    tfv = TfidfVectorizer(min_df=3,max_features=None,strip_accents='unicode',analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)
    tfv.fit(text)
    text = tfv.transform(text)
    tsv.fit(text)

Value error : array is too big

我可以使用哪些其他方法或降维。

score 4 · Accepted Answer

我很确定问题是：

tsv = TruncatedSVD(n_components=10000...

您的 SVD 中有 10000 个组件。如果您有一个 mxn 数据矩阵，则 SVD 将具有维度为 mx n_components 和 n_components xn 的矩阵，即使数据稀疏，这些矩阵也会很密集。这些矩阵可能太大了。

我复制了您的代码并在 Kaggle Hashtag 数据（我认为这是来自）上运行它，并且在 300 个组件中，python 使用了多达 1GB。在 10000 时，您将使用大约 30 倍。

顺便说一句，您在这里所做的是潜在语义分析，这不太可能从这么多组件中受益。50-300 范围内的某处应该捕获所有重要的内容。

score 0 · Accepted Answer

There is a possibility that you are getting this error as you are using 32 bit python. Try switching to 64 bit. The other approach for dimensionality reduction for sparse matrices is using RandomizedPCA which is PCA using randomized SVD.

scipy - tf idf 上的截断 svd 给出值错误数组太大

2 回答 2

Related

Reference