r - 映射到多个数据的不同列

Question

我有一个函数，它接受两个向量并计算一个数值（就像cor相关性一样）。但是，我有两个大约 6000 列的数据集（两个数据集具有相同的维度），其中函数应该返回一个具有相关值的向量。

带有循环的代码如下所示：

set.seed(123)
m=matrix(rnorm(9),ncol=3)
n=matrix(rnorm(9,10),ncol=3)

colNumber=dim(m)[2]
ReturnData=rep(NA,colNumber)

for (i in 1:colNumber){
    ReturnData[i]=cor(m[,i],n[,i])
}

这很好用，但出于效率原因，我想使用 apply-family，很明显，mapply 函数。

但是，mapply(cor,m,n)返回一个长度为 9 的向量NA，它应该返回：

> ReturnData
[1]  0.1247039 -0.9641188  0.5081204

编辑/解决方案

@akrun 给出的解决方案是使用数据帧而不是矩阵。

此外，两个提议的解决方案之间的速度测试表明，mapply-version 比sapply：

require(rbenchmark) 
set.seed(123)
#initiate the two dataframes for the comparison 
m=data.frame(matrix(rnorm(10^6),ncol=100))
n=data.frame(matrix(rnorm(10^6),ncol=100))
#indx is needed for the sapply function to get the column numbers
indx=seq_len(ncol(m)) 
benchmark(s1=mapply(cor, m,n), s2=sapply(indx, function(i) cor(m[,i], n[,i])), order="elapsed", replications=100)

#test replications elapsed relative user.self sys.self user.child sys.child
#2   s2          100    4.16    1.000      4.15        0         NA        NA
#1   s1          100    4.33    1.041      4.32        0         NA        NA

score 1 · Accepted Answer

因为您的数据集是matrix，所以mapply将遍历每个元素而不是每一列。为避免这种情况，请转换为数据框。我不确定这对大数据集有多有效。

mapply(cor, as.data.frame(m), as.data.frame(n))
#     V1         V2         V3 
#0.1247039 -0.9641188  0.5081204

另一种选择是使用sapply而不转换为data.frame

 indx <- seq_len(ncol(m))
 sapply(indx, function(i) cor(m[,i], n[,i]))
 #[1]  0.1247039 -0.9641188  0.5081204

r - 映射到多个数据的不同列

1 回答 1

Related

Reference