r - 由于大数据集导致 R 中的 PCoA 错误

Question

对于我的工作项目，我必须执行 PCoA（主坐标分析，又称多维缩放）。但是，当使用 R 执行此分析时，我遇到了一些问题。

函数 cmdscale 只接受矩阵或 dist 作为输入， dist 函数给出错误：

Error: cannot allocate vector of size 4.2 Gb
In addition: Warning messages:
1: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) :
  Reached total allocation of 4020Mb: see help(memory.size)
2: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) :
  Reached total allocation of 4020Mb: see help(memory.size)
3: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) :
  Reached total allocation of 4020Mb: see help(memory.size)
4: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) :
  Reached total allocation of 4020Mb: see help(memory.size)

当我使用矩阵时，它会将输入更改为：

     [,1]         
[1,] Integer,33741
[2,] Integer,33741

数据集的内容无法在线发布，但我可以为您提供尺寸：数据集长 33741 行，宽 11 列，第一列是 ID，其他 10 个值需要用于 PCoA。

正如您在错误中看到的那样，我只使用了 2 列，并且已经出现内存错误。

现在我的问题
是：是否可以通过 dist 函数的内存限制来管理数据？
我在矩阵函数中做错了什么，它将向量更改为 2 列 2 行输出？

我试过的：用垃圾收集清除，重新启动 GUI，重新启动系统。

系统：Windows 7 x64 i7 920qm 1.8ghz 4GB DDR3 内存

使用的代码：

mydata <- read.table(file, header=TRUE)

mydist <- dist(mydata[c(3,4)], method="euclidian", diag=FALSE, upper=FALSE)
mymatrix <- matrix(mydata[c(3,4)], byrow=FALSE)
mymatrix <- matrix(cbind(mydata[c(3,4)]))

mycmdscale <- cmdscale(mydist, k=2, eig=FALSE, add=FALSE, x.ret=FALSE)
mycmdscale <- cmdscale(mymatrix, k=2, eig=FALSE, add=FALSE, x.ret=FALSE)

plot(mycmdscale)

当然，我没有按此顺序运行代码，但此代码包含我尝试加载数据的方法。

提前感谢您的任何回复。

score 1 · Accepted Answer

我知道这是旧的，但我想我会投入我得到的......

我有点惊讶@Gavin Simpson 没有提到在欧几里得距离矩阵上计算主坐标分析与主成分分析相同（至少两者都使用 scaling=1）。

这是根据 p。143 在 Borcard, D.、Gillet, F. 和 Legendre, P. (2011)。第 5 章不受约束的受戒（第 115-151 页）。纽约，纽约：纽约斯普林格。doi:10.1007/978-1-4419-7976-6

我可以在我当前的本地机器上正常运行系统：Windows 7 x64 i5-2500 3.3ghz 8GB RAM

library(vegan) # to perform PCA and associated operations 
library(ggplot2) # plotting (not necessary, but nice)
library(grid) # arrow()

#make a big test set like OP's
test<-data.frame(id=seq(34000), var1=rnorm(34000), var2=rnorm(34000),
                 var3=rnorm(34000),var4=rnorm(34000),var5=rnorm(34000),
                 var6=rnorm(34000),var7=rnorm(34000),var8=rnorm(34000),
                 var9=rnorm(34000),var10=rnorm(34000))
#calculate PCA
test.pca<-rda(test, scale=TRUE)

#calculate percent variation on each axis
test.pca.percExp<-round(eigenvals(test.pca)/sum(eigenvals(test.pca))*100, 2)

#extract scores for plotting
test.pca.sc<-scores(test.pca, choices=c(1,2), 
                           display=c("sites", "species"), scaling=1)

test.pca.site<-data.frame(test.pca.sc$sites)
test.pca.spe<-data.frame(test.pca.sc$species)
test.pca.spe$VAR<-rownames(test.pca.spe)

#make the plot
test.pca.p<-ggplot(test.pca.site, aes(PC1, PC2)) + 
  xlab(sprintf("PC1 %s%s", test.pca.percExp[1], "%")) + 
  ylab(sprintf("PC2 %s%s", test.pca.percExp[2], "%")) 

#add points and biplot arrows to plot
test.pca.p + 
  geom_point() +
  geom_segment(data = test.pca.spe,
               aes(x = 0, xend = PC1, y = 0, yend = PC2),
               arrow = arrow(length = unit(0.25, "cm")), colour = "grey") +
  geom_text(data=test.pca.spe,
            aes(x=PC1,y=PC2,label=VAR),
            size=3, position=position_jitter(width=-2, height=0.1))+
  guides(color = guide_legend(title = "Var"))

在此处输入图像描述

#hard to see the points with arrows, so plot without the arrows
test.pca.p + 
  geom_point()

在此处输入图像描述

我偶然发现了这个问题，因为我在曼哈顿距离矩阵上遇到了同样的问题，我的回答对此无济于事（据我所知，可能有一种方法可以在 PCA 之前转换数据，从而得到相同的结果。 .)。这个答案基本上会给出我相信OP正在寻找的结果。希望这对其他人也有帮助......

score 0 · Accepted Answer

您的内存太少，无法在 R 中执行此操作，它将所有对象都保存在内存中。我可能没有完全正确的精确计算（我忘记了 R 对象的大小），但只是为了保存相异矩阵，你需要大约 9GB 的 RAM。

> print(object.size(matrix(0, ncol = 34000, nrow = 34000)), units = "Gb")
8.6 Gb

dist将在内部表示中减少，因为它实际上只存储0.5 * (nr * (nr - 1))双精度数（nr是输入数据中的行数）：

> print(object.size(numeric(length = 0.5 * 34000 * 33999)), units = "Gb")
4.3 Gb

[这可能是您看到的错误的来源]

实际上，一旦计算了相异矩阵，您将需要超过 20-30GB 的 RAM 才能对相异矩阵进行任何有用的操作。即使您可以计算它们，PCoA 解决方案的特征向量也需要约 9Gb 的 RAM，仅靠它们自己。

所以一个更相关的问题是；你希望用 c 做什么。34000 个样本/观察？

要从中获取矩阵，mydata[3:4]您可以使用

as.matrix(mydata[3:4])

或者，如果您有因子并希望保留它们的数字解释

data.matrix(mydata[3:4])

r - 由于大数据集导致 R 中的 PCoA 错误

2 回答 2

Related

Reference