python - Hstacking 特征以某种方式导致预测的额外减慢

Question

当我使用scipy.sparse.hstack由 CountVectorizer 等生成的一些稀疏矩阵时，我想合并它们以用于回归，但不知何故它们速度较慢：

X1 有 10000 个来自 analyse="char" 的特征
X2 有 10000 个来自 analyse="word" 的特征
X3 有 20000 个来自 analyse="char" 的特征
X4 有 20000 个来自 analyse="word" 的特征

您会期望，当您使用 hstack X1 和 X2 时，它的速度与 X3 或 X4 大致相同（功能数量相同）。但这似乎还没有接近：

from scipy.sparse import hstack
>>> a=linear_model.Ridge(alpha=30).fit(hstack((X1, X2)),y).predict(hstack((t1,t2)))
time:  57.85
>>> b=linear_model.Ridge(alpha=30).fit(X1,y).predict(t1)
time:  6.75
>>> c=linear_model.Ridge(alpha=30).fit(X2,y).predict(t2)
time:  7.33
>>> d=linear_model.Ridge(alpha=30).fit(X3,y).predict(t3)
time:  6.80
>>> e=linear_model.Ridge(alpha=30).fit(X4,y).predict(t4)
time:  11.67

我什至注意到，当我hstack只有一个功能时，模型也会变慢。什么可能导致这种情况，我做错了什么，当然，有什么改进？

值得注意的编辑：

我想介绍一种我认为可以解决它的方法，即构建一个词汇表并使用它来适应：

feats = []
method = CountVectorizer(analyzer="word", max_features=10000, ngram_range=(1,3))
method.fit(train["tweet"])
X = method.fit(...)
feats.extend(method.vocabulary_.keys())
method = CountVectorizer(analyzer="char", max_features=10000, ngram_range=(4,4))
method.fit(train["tweet"])
X2 = method.fit(...)
feats.extend(method.vocabulary_.keys())
newm = CountVectorizer(vocabulary=feats)
newm.fit(train["tweet"])
X3 = newm.fit(...)

当我适合这些时，存储的项目数量会发生一些奇怪的事情（我对没有 20,000 个特征并不感到惊讶，因为可能会有重叠）。怎么会有这么少的“一”？

X
<49884x10000 sparse matrix of type '<class 'numpy.int64'>'
    with 927131 stored elements in Compressed Sparse Row format>
X2
<49884x10000 sparse matrix of type '<class 'numpy.int64'>'
    with 3256162 stored elements in Compressed Sparse Row format>
X3
<49884x19558 sparse matrix of type '<class 'numpy.int64'>'
    with 593712 stored elements in Compressed Sparse Row format>

score 3 · Accepted Answer

Hstacking 将其转换为 COO 格式：

>>> hstack((csr_matrix([1]), csr_matrix([2])))
<1x2 sparse matrix of type '<type 'numpy.int64'>'
    with 2 stored elements in COOrdinate format>

也许hstack(...).tocsr()可以检查它是否加快了速度。

score 1 · Accepted Answer

您可以hstack轻而易举地创建两个 CSC 矩阵，同时保持输出 CSC：

In [1]: import scipy.sparse as sps

In [2]: a = sps.csc_matrix(np.arange(25).reshape(5, 5))

In [3]: b = sps.csc_matrix(np.arange(25).reshape(5, 5))

In [4]: data = np.concatenate((a.data, b.data))

In [5]: indices = np.concatenate((a.indices, b.indices))

In [7]: indptr = np.concatenate((a.indptr[:-1], b.indptr + a.indptr[-1]))


In [10]: c = sps.csc_matrix((data, indices, indptr),
...                         shape = (a.shape[0], a.shape[1]+b.shape[1]))

In [11]: c.A
Out[11]: 
array([[ 0,  1,  2,  3,  4,  0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 20, 21, 22, 23, 24]])

完全相同的代码，替换csc为csr到处，将vstack两个 CSR 矩阵。

您需要做一些时间安排，但在大多数情况下，我相信将您的矩阵转换为 CSR 或 CSC 会更快，具体取决于您要执行的堆叠，按上述方式进行堆叠，然后将结果转换为您想要的任何内容，而不是使用内置的堆叠功能。

python - Hstacking 特征以某种方式导致预测的额外减慢

值得注意的编辑：

2 回答 2

Related

Reference