经过进一步思考,解决方案变得显而易见。
我没有使用长格式的所有观测值(200k),而是将采样的经度和深度放入一个变量中,就像沿样带的采样单元一样使用。因此,最终得到 3800 列经度 - 深度组合,以及 61 行分类单元,值变量是分类单元的丰度(如果要对采样单元进行聚类,则必须转置 df)。这对于 hclust 或 SIMPROF 是可行的,因为现在二次复杂度仅适用于 61 行(而不是我一开始尝试的 ~200k)。
干杯
这是一些代码:
library(reshape2)
library(dplyr)
d4<-d4 %>% na.omit() %>% arrange(desc(LONGITUDE_DEC))
# make 1 variable of longitude and depth that can be used for all taxa measured, like
#community ecology sampling units
d4$sampling_units<-paste(d4$LONGITUDE_DEC,d4$BIN_MIDDEPTH_M)
d5<-d4 %>% select(PREDICTED_GROUP,CONCENTRATION_IND_M3,sampling_units)
d5<-d5%>%na.omit()
# dcast data frame so that you get the taxa as rows, sampling units as columns w
# concentration/abundance as values.
d6<-dcast(d5,PREDICTED_GROUP ~ sampling_units, value.var = "CONCENTRATION_IND_M3")
d7<-d6 %>% na.omit()
d7$PREDICTED_GROUP<-as.factor(d7$PREDICTED_GROUP)
# give the rownames the taxa names
rownames(d7)<-paste(d7$PREDICTED_GROUP)
#delete that variable that is no longer needed
d7$PREDICTED_GROUP<-NULL
library(vegan)
# calculate the dissimilarity matrix with vegdist so you can use the sorenson/bray
#method
distBray <- vegdist(d7, method = "bray")
# calculate the clusters with ward.D2
clust1 <- hclust(distBray, method = "ward.D2")
clust1
#plot the cluster dendrogram with dendextend
library(dendextend)
library(ggdendro)
library(ggplot2)
dend <- clust1 %>% as.dendrogram %>%
set("branches_k_color", k = 5) %>% set("branches_lwd", 0.5) %>% set("clear_leaves") %>% set("labels_colors", k = 5) %>% set("leaves_cex", 0.5) %>%
set("labels_cex", 0.5)
ggd1 <- as.ggdend(dend)
ggplot(ggd1, horiz = TRUE)