1

我有一个包含 2 个不同(1 个外部运行,1 个自己完成)聚类解决方案的数据集。我想使用包中的tanglegramandentanglement命令来比较它们dendextend,但是我一直有关于标签的错误,我不知道为什么。为了说明,我用 mtcars 做了一个简单的例子:

df1 <- mtcars
df1$ID <- row.names(mtcars)
clusts <- 1:3

# simulate two different cluster algorithms as columns containing cluster group
df1$cl1 <- sample(clusts, nrow(df1), replace = TRUE)
df1$cl2 <- sample(clusts, nrow(df1), replace = TRUE)
table(df1$cl1, df1$cl2)

# Make a copy
df2 = df1

# Use data.tree to convert df's to data.trees
library(data.tree)
df1$pathString <- paste("Tree1", df1$cl1, df1$ID, sep = "/")
df2$pathString <- paste("Tree2", df2$cl2, df2$ID, sep = "/")

node1 <- as.Node(df1)
node2 <- as.Node(df2)

# Convert to dendrograms and compare using dendextend
library(dendextend)
dend1 <- as.dendrogram(node1)
dend2 <- as.dendrogram(node2)

tanglegram(dend1, dend2)
entanglement(dend1, dend2)

这给出了这些错误:

> tanglegram(dend1, dend2)
Error in dend12[[1]] : subscript out of bounds
In addition: Warning message:
In intersect_trees(dend1, dend2, warn = TRUE) :
  The two trees had no common labels!
> entanglement(dend1, dend2)
Error in match_order_by_labels(dend2, dend1) : 
  labels do not match in both trees.  Please make sure to fix the labels    names!
(make sure also that the labels of BOTH trees are 'character')

我不明白为什么会发生这些错误,并且检查数据结构并没有给我答案!任何有用的启示将不胜感激!

编辑 注意下面@emilliman5 的回答:我知道我的树状图未解析-我没有使用层次聚类,因此我想比较未解析的树状图。更多-我从这个问题中采用了一些代码:如何手动创建树状图(或“hclust”)对象?(在 R 中)自己构建树状图 - 尽管尚未解决,但这些会产生一个缠结图。然而,这不是一个解决方案,因为它太难推广到不同的参数(我的树深度/分辨率会发生变化,并且尝试编写一个函数来对具有不同嵌套级别的树进行编码是一条通向疯狂的道路!)。

tree1 <- list()
attributes(tree1) <- list(members=nrow(df1), height=3)
class(tree1) <- "dendrogram"

# Assign leaf names to list
leaves <- list()
leaf_height_list <- list()
for(i in 1:length(clusts)){
    leaves[[i]] <- which(df1$cl1 == (i) )
}
for(i in 1:length(clusts)){
    tree1[[i]] <- list()
    attributes(tree1[[i]]) <- list(members=length(which(df1$cl1==i)), height=2, edgetext=i)
    for( j in 1:length(leaves[[i]]) ){
        tree1[[i]][[j]] <- list()
        tree1[[i]][[j]] <- leaves[[i]]
        attributes(tree1[[i]][[j]]) <- list(members = 1, height = 1,
                                       label = as.character(leaves[[i]][j]),
                                       leaf = TRUE)
    }
}
plot(tree1, center=TRUE)

tree2 <-list();
attributes(tree2) <- list(members=nrow(df2), height=3)
class(tree2) <- "dendrogram"

# Assign leaf names to list
leaves <- list()
leaf_height_list <- list()
for(i in 1:length(clusts)){
    leaves[[i]] <- which(df2$cl2 == (i) )
}
for(i in 1:length(clusts)){
    tree2[[i]] <- list()
    attributes(tree2[[i]]) <- list(members=length(which(df2$cl2==i)), height=2, edgetext=i)
    for( j in 1:length(leaves[[i]]) ){
        tree2[[i]][[j]] <- list()
        tree2[[i]][[j]] <- leaves[[i]]
        attributes(tree2[[i]][[j]]) <- list(members = 1, height = 1,
                                        label = as.character(leaves[[i]][j]),
                                        leaf = TRUE)
    }
}
plot(tree2, center=TRUE)

tanglegram(tree1, tree2)

丑陋的缠结

它很丑,但它是我想要/需要的。

如果我查看树状图,试图弄清楚为什么会这样:

> str(unclass(tree1[[1]][[1]]))
 atomic [1:12] 1 8 9 10 11 13 16 22 25 27 ...
 - attr(*, "members")= num 1
 - attr(*, "height")= num 1
 - attr(*, "label")= chr "1"
 - attr(*, "leaf")= logi TRUE

你注意到有一个向量。查看 hclust 派生的树状图,我们看到还有一个向量/原子:

> str(unclass(as.dendrogram(hclust(dist(df1))))[[1]][[1]])
 atomic [1:1] 31
 - attr(*, "members")= int 1
 - attr(*, "height")= num 0
 - attr(*, "label")= chr "Maserati Bora"
 - attr(*, "leaf")= logi TRUE

但是,查看 data.tree 创建的树状图,我注意到没有向量/原子:

> str(unclass(dend1[[1]][[1]]))
 list()
 - attr(*, "label")= chr "Mazda RX4"
 - attr(*, "members")= num 1
 - attr(*, "height")= num 0
 - attr(*, "leaf")= logi TRUE

这个缺失的原子会导致问题吗?

4

1 回答 1

1

问题是您的树不是二分的,即在每个节点上,您可以遍历两个以上的分支。在层次聚类中,每个节点应该只有两个分支。请参见下面的两个示例:

这是您示例中的树

在此处输入图像描述

这就是解析树的样子

plot(hclust(dist(df1[, 1:11])))

在此处输入图像描述

于 2017-02-22T16:54:07.437 回答