4

I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values. My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it... If it helps this is my code... nothing special or hard..

def SimMat(mat,name):

    mrange = mat.shape[0]
    print "mrange:", mrange
    nTRs = mat.shape[1]
    print "nTRs:", nTRs

    SimM = numpy.zeros((mrange,mrange))



    for i in range(mrange):

        SimM[i][i] = 1

    for i in range (mrange):
        for j in range(i+1, mrange):
            pearV = scipy.stats.pearsonr(mat[i], mat[j])

            if(pearV[1] <= 0.05):
                if(pearV[0] >= 0.5):
                    print "Pearson value:", pearV[0]
                    SimM[i][j] = pearV[0]
                    SimM[j][i] = 0
            else:

                SimM[i][j] = SimM[j][i] = 0

    numpy.savetxt(name, SimM)




    return SimM, nTRs

Thanks

4

1 回答 1

4

您的实现的主要问题是存储相关系数所需的内存量(至少4.5GB)。没有理由将已经计算的系数保存在内存中。对于这样的问题,我喜欢使用 hdf5 来存储中间结果,因为它们与 numpy 配合得很好。这是一个完整的、最小的工作示例:

import numpy as np
import h5py
from scipy.stats import pearsonr

# Create the dataset
h5 = h5py.File("data.h5",'w')
h5["test"] = np.random.random(size=(24000,316))
h5.close()

# Compute dot products
h5 = h5py.File("data.h5",'r+')
A  = h5["test"][:]

N   = A.shape[0]
out = h5.require_dataset("pearson", shape=(N,N), dtype=float)

for i in range(N):
    out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]

对前 100 行的测试表明,这在单个内核上只需要 8 小时。如果你将它并行化,它应该随着内核数量的增加而线性加速。

于 2015-05-07T19:34:31.617 回答