r - R中的引导变量相关性

Question

我的目的是编写几个函数，旨在找出两个协方差矩阵之间的总体相似性，或者通过将它们与随机向量相乘并关联响应向量，或者通过引导其中一个矩阵来获得可用于比较的相关系数分布。但在这两种情况下，我都得到了错误的结果。观察到的矩阵间相关性高达 0.93，但分布范围最高仅达到 0.2。这是函数的代码：

resamplerSimAlt <- function(mat1, mat2, numR, graph = FALSE)
{
  statSim <- numeric(numR)
  mat1vcv <- cov(mat1)
  mat2vcvT <- cov(mat2)
  ltM1 <- mat1vcv[col(mat1vcv) <= row(mat1vcv)]
  ltM2T <- mat2vcvT[col(mat2vcvT) <= row(mat2vcvT)]
  statObs <- cor(ltM1, ltM2T)                           
  indice <- c(1:length(mat2))
  resamplesIndices <- lapply(1:numR, function(i) sample(indice, replace = F))
  for (i in 1:numR)
  {
    ss <- mat2[sample(resamplesIndices[[i]])]
    ss <- matrix(ss, nrow = dim(mat2)[[1]], ncol = dim(mat2)[[2]])
    mat2ss <- cov(ss)
    ltM2ss <- mat2ss[col(mat2ss) <= row(mat2ss)]
    statSim[i] <- cor(ltM1, ltM2ss)
  }
  if (graph == TRUE)
  {
    plot(1, main = "resampled data density distribution", xlim = c(0, statObs+0.1), ylim = c(0,14))
    points(density(statSim), type="l", lwd=2)
    abline(v = statObs)
    text(10, 10, "observed corelation = ")
  }
  list( obs = statObs , sumFit = sum(statSim > statObs)/numR)
}

事实上，我很难相信两个原始矩阵之间的相关系数很高，并且第一个原始矩阵和第二个重新采样的矩阵之间的相关系数在 10000 次引导重复后最大为 0.2。

对代码的有效性有何评论？

score 2 · Accepted Answer

Sorry, I am not enough educated to catch up your goals about checking the correlation efficient between two covariance matrices, but I tried to apprehend your code per se.

If I am right, you are making up 10.000 different matrices from the same matrix (mat2) by reordering the cells all round, and recomputing the correlation between the covariance matrix of mat1 and the covariance matrix of the resampled array. Those are stored in the statSim variable.

You said the original correaltion efficient was high (statObs), but the maximum of statSim was low, which is strange. I think the problem is with your result list:

list( obs = statObs , sumFit = sum(statSim > statObs)/numR)

Where you return the original correaltion coefficient (obs), but not the written maximum with sumFit. There you might use eg. max(statSim). I see the point in returning sumFit for checking if the resampling did any improvement to the correlation efficient, but based on your code, I see no problem about the theory.

Updated function with max of simulated correlation coefficients:

resamplerSimAlt <- function(mat1, mat2, numR, graph = FALSE)
{
  statSim <- numeric(numR)
  mat1vcv <- cov(mat1)
  mat2vcvT <- cov(mat2)
  ltM1 <- mat1vcv[col(mat1vcv) <= row(mat1vcv)]
  ltM2T <- mat2vcvT[col(mat2vcvT) <= row(mat2vcvT)]
  statObs <- cor(ltM1, ltM2T)                           
  indice <- c(1:length(mat2))
  resamplesIndices <- lapply(1:numR, function(i) sample(indice, replace = F))
  for (i in 1:numR)
  {
    ss <- mat2[sample(resamplesIndices[[i]])]
    ss <- matrix(ss, nrow = dim(mat2)[[1]], ncol = dim(mat2)[[2]])
    mat2ss <- cov(ss)
    ltM2ss <- mat2ss[col(mat2ss) <= row(mat2ss)]
    statSim[i] <- cor(ltM1, ltM2ss)
  }
  if (graph == TRUE)
  {
    plot(1, main = "resampled data density distribution", xlim = c(0, statObs+0.1), ylim = c(0,14))
    points(density(statSim), type="l", lwd=2)
    abline(v = statObs)
    text(10, 10, "observed corelation = ")
  }
  list( obs = statObs , sumFit = sum(statSim > statObs)/numR, max=max(statSim))
}

What I had run:

> mat1 <- matrix(runif(25),5,5)
> mat2 <- mat1+0.2
> resamplerSimAlt(mat1, mat2, 10000)
$obs
[1] 1

$sumFit
[1] 0

$max
[1] 0.94463

And with random mat2:

> mat2 <- matrix(runif(25),5,5)
> resamplerSimAlt(mat1, mat2, 10000)
$obs
[1] 0.31144

$sumFit
[1] 0.9124

$max
[1] 0.9231

My answer might not be a real answer. If that would be the case, please give more details about the problem :)

r - R中的引导变量相关性

1 回答 1

Related

Reference