r - 对 2 个距离矩阵求和以获得第三个“整体”距离矩阵（生态环境）

Question

我是生态学家，主要使用纯素 R 包。

我有 2 个矩阵（样本 x 丰度）（见下面的数据）：

矩阵 1/ nrow= 6replicates*24sites, ncol=15 物种丰度（鱼）矩阵 2/ nrow= 3replicates*24sites, ncol=10 物种丰度（无脊椎动物）

两个矩阵中的位点相同。我想获得成对站点之间的整体布雷柯蒂斯差异（考虑两个矩阵）。我看到 2 个选项：

选项 1，对重复（在站点规模）鱼类和大型无脊椎动物丰度进行平均，cbind 两个平均丰度矩阵（nrow=24sites，ncol=15+10 平均丰度）并计算 bray-curtis。

选项 2，对于每个组合，计算站点对之间的布雷-柯蒂斯相异度，计算站点质心之间的距离。然后对2个距离矩阵求和。

如果我不清楚，我在下面的 R 代码中做了这两个操作。

请您告诉我选项 2 是否正确且比选项 1 更合适。

先感谢您。

皮埃尔

下面是 R 代码示例

生成数据

library(plyr);library(vegan)

#assemblage 1: 15 fish species, 6 replicates per site
a1.env=data.frame(
  Habitat=paste("H",gl(2,12*6),sep=""),
  Site=paste("S",gl(24,6),sep=""),
  Replicate=rep(paste("R",1:6,sep=""),24))

summary(a1.env)

a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1))))

names(a1.bio)=paste("F",1:15,sep="")

a1.bio[1:72,]=2*a1.bio[1:72,]

#assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site

a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),]

summary(a2.env)

a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1))))

names(a2.bio)=paste("I",1:10,sep="")

a2.bio[1:36,]=0.5*a2.bio[1:36,]


#environmental data at the sit scale

env=unique(a1.env[,c("Habitat","Site")])

env=env[order(env$Site),]

选项 1，平均丰度和 cbind

a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean))

a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),]

a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean))

a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),]

bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)])

dist.mean=vegdist(sqrt(bio.mean),"bray")

选项 2，计算质心之间的每个组合距离并对 2 个距离矩阵求和

a1.dist=vegdist(sqrt(a1.bio),"bray")

a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids

a1.dist.centroid=vegdist(a1.coord.centroid,"eucl")

a2.dist=vegdist(sqrt(a2.bio),"bray")

a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids

a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")

使用 Gavin Simpson 的 fuse() 对两个距离矩阵求和

dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))

总结两个欧几里得距离矩阵（感谢 Jari Oksanen 校正）

dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)

以及下面的“coord.centroid”，用于进一步基于距离的分析（是否正确？）

coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)

比较选项 1 和 2

pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray"))

pco.centroid=cmdscale(dist.centroid)

comparison=procrustes(pco.centroid,pco.mean)

protest(pco.centroid,pco.mean)

score 4 · Accepted Answer

平均距离的正确性取决于你对这些距离做了什么。在某些应用程序中，您可能会认为它们确实是距离。也就是说，它们满足某些度量属性并且与原始数据具有定义的关系。组合的差异可能无法满足这些要求。

这个问题与对差异性的部分 Mantel 类型分析与对 beta 多样性研究中真正热门（我的意思是红热）的矩形数据分析的争议有关。我们素食主义者为两者提供工具，但我认为在大多数情况下，矩形数据的分析更稳健、更强大。对于矩形数据，我的意思是正常采样单位乘以物种矩阵。素食主义者中首选的基于差异的方法将差异映射到矩形。素食主义者的这些方法包括 db-RDA ( capscale)、置换 MANOVA ( adonis) 和组内分散分析 ( betadisper)。处理这种差异性的方法包括mantel, anosim, mrpp, meandis.

差异或距离的平均值通常与原始矩形数据没有明确的对应关系。即：差异的平均值与数据的平均值不对应。我认为通常最好对数据进行平均或处理，然后从转换后的数据中获得差异。

如果你想结合不同之处，analogue::fuse() 风格方法是最实用的。但是，您应该了解，这fuse() 也将相异矩阵缩放为相等的最大值。如果您有 0..1 尺度的相异性度量，这通常是次要问题，除非其中一个数据集更同质且最大相异性低于其他数据集。它们都被均衡化了，fuse()因此它不是简单的平均，而是范围均衡后的平均。此外，您必须记住，平均差异通常会破坏几何形状，如果您使用矩形化数据的分析方法（素食主义者中的 , , ）adonis，betadisper这将很重要。capscale

最后是关于组合相异的几何学。0..1 级的差异指数是 A/B 类型的分数。只有当分母相等时，才能直接将两个分数相加（然后除以求平均值）。如果您忽略这一点并直接平均分数，则结果将不等于平均数据的相同分数。这就是我所说的破坏几何的意思。一些开放式指数不是分数，可能是相加的。曼哈顿距离是相加的。欧几里得距离是平方差的平方根，它们的平方是加法的，但不是直接的距离。

我通过展示将两个不同点相加的效果来展示这些东西（平均意味着将结果除以二，或除以合适的权重）。我采用素食主义者的 Barro Colorado Island 数据并将其分成两个大小略有不同的子集。保留数据子集距离的几何图形将给出与完整数据分析相同的结果：

library(vegan) ## data and vegdist
library(analogue) ## fuse
data(BCI)
dim(BCI) ## [1]  50 225
x1 <- BCI[, 1:100]
x2 <- BCI[, 101:225]
## Bray-Curtis and fuse: not additive
plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225)))
## summing distances is straigthforward (they are vectors), but preserving
## their attributes and keeping the dissimilarities needs fuse or some trick
## like below where we make dist structure dtmp to be replaced with the result
dtmp <- dist(BCI) ## dist skeleton with attributes
dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan")
## manhattans are additive and can be averaged
plot(dist(BCI, "manhattan"), dtmp)
## Fuse rescales dissimilarities and they are no more additive
dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225))
plot(dist(BCI, "manhattan"), dfuse)
## Euclidean distances are not additive
dtmp[] <- dist(x1) + dist(x2)
plot(dist(BCI), dtmp)
## ... but squared Euclidean distances are additive
dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2)
plot(dist(BCI), dtmp)
## dfuse would rescale squared Euclidean distances like Manhattan (not shown)

我只考虑了上面的加法，但如果你不能加法，你就不能平均。如果这很重要，这是一个品味问题。勇敢的人会对无法平均的事情进行平均，但有些人比较胆小，想要遵守规则。我宁愿去第二组。

score 4 · Accepted Answer

一个更简单的解决方案是通过加权每个矩阵来灵活地组合两个相异矩阵。权重需要总和为 1。对于两个相异矩阵，融合相异矩阵为

d.fused = (w * d.x) + ((1 - w) * d.y)

其中w是数字标量（长度为 1 的向量）权重。如果您没有理由比另一组更加权一组不同，只需使用w = 0.5.

我的模拟包中有一个功能可以为您执行此操作；fuse(). 来自的例子?fuse是

 train1 <- data.frame(matrix(abs(runif(100)), ncol = 10))
 train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE),
                      ncol = 10))
 rownames(train1) <- rownames(train2) <- LETTERS[1:10]
 colnames(train1) <- colnames(train2) <- as.character(1:10)

 d1 <- vegdist(train1, method = "bray")
 d2 <- vegdist(train2, method = "jaccard")

 dd <- fuse(d1, d2, weights = c(0.6, 0.4))
 dd
 str(dd)

这个想法被用于有监督的 Kohonen 网络（有监督的 SOM），以将多层数据带入单个分析中。

模拟与素食主义者密切合作，因此并排运行这两个包不会有任何问题。

score 0 · Accepted Answer

我喜欢这个答案的简单性，但它仅适用于添加 2 个距离矩阵：

d.fused = (w * d.x) + ((1 - w) * d.y)

所以我编写了自己的代码片段来组合多个距离矩阵（不仅仅是 2 个）的数组，并使用标准 R 包：

# generate array of distance matrices
x <- matrix(rnorm(100), nrow = 5)
y <- matrix(rnorm(100), nrow = 5)
z <- matrix(rnorm(100), nrow = 5)
dst_array <- list(dist(x),dist(y),dist(z))

# create new distance matrix with first element of array
dst <- dst_array[[1]]

# loop over remaining array elements, add them to distance matrix
for (jj in 2:length(dst_array)){
  dst <- dst + dst_array[[jj]]
}

您还可以使用大小相似的向量dst_array来定义比例因子

dst <- dst + my_scale[[jj]] * dst_array[[jj]]