2

基本上,我只是想做一个简单的矩阵乘法,具体来说,提取它的每一列并通过将它除以它的长度来标准化它。

    #csc sparse matrix
    self.__WeightMatrix__ = self.__WeightMatrix__.tocsc()
    #iterate through columns
    for Col in xrange(self.__WeightMatrix__.shape[1]):
       Column = self.__WeightMatrix__[:,Col].data
       List = [x**2 for x in Column]
       #get the column length
       Len = math.sqrt(sum(List))
       #here I assumed dot(number,Column) would do a basic scalar product
       dot((1/Len),Column)
       #now what? how do I update the original column of the matrix, everything that have been returned are copies, which drove me nuts and missed pointers so much

我搜索了 scipy 稀疏矩阵文档,没有得到任何有用的信息。我希望一个函数返回一个指向矩阵的指针/引用,以便我可以直接修改它的值。谢谢

4

1 回答 1

5

在 CSC 格式中,您有两个可写属性dataindices,它们保存矩阵的非零条目和相应的行索引。您可以利用这些优势,如下所示:

def sparse_row_normalize(sps_mat) :
    if sps_mat.format != 'csc' :
        msg = 'Can only row-normalize in place with csc format, not {0}.'
        msg = msg.format(sps_mat.format)
        raise ValueError(msg)
    row_norm = np.sqrt(np.bincount(sps_mat.indices, weights=mat.data * mat_data))
    sps_mat.data /= np.take(row_norm, sps_mat.indices)

要查看它是否确实有效:

>>> mat = scipy.sparse.rand(4, 4, density=0.5, format='csc')
>>> mat.toarray()
array([[ 0.        ,  0.        ,  0.58931687,  0.31070526],
       [ 0.24024639,  0.02767106,  0.22635696,  0.85971295],
       [ 0.        ,  0.        ,  0.13613897,  0.        ],
       [ 0.        ,  0.13766507,  0.        ,  0.        ]])
>>> mat.toarray() / np.sqrt(np.sum(mat.toarray()**2, axis=1))[:, None]
array([[ 0.        ,  0.        ,  0.88458487,  0.46637926],
       [ 0.26076366,  0.03003419,  0.24568806,  0.93313324],
       [ 0.        ,  0.        ,  1.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ]])
>>> sparse_row_normalize(mat)
>>> mat.toarray()
array([[ 0.        ,  0.        ,  0.88458487,  0.46637926],
       [ 0.26076366,  0.03003419,  0.24568806,  0.93313324],
       [ 0.        ,  0.        ,  1.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ]])

而且它也非常快,没有 Python 循环破坏乐趣:

In [2]: mat = scipy.sparse.rand(10000, 10000, density=0.005, format='csc')

In [3]: mat
Out[3]: 
<10000x10000 sparse matrix of type '<type 'numpy.float64'>'
    with 500000 stored elements in Compressed Sparse Column format>

In [4]: %timeit sparse_row_normalize(mat)
100 loops, best of 3: 14.1 ms per loop
于 2013-03-04T08:21:55.847 回答