我正在尝试执行一个非常简单的聚类分析,但无法得到正确的结果。我对大型数据集的问题是“哪些疾病经常一起报告?”。下面的简化数据样本应导致 2 个集群:1) 头痛/头晕 2) 恶心/腹痛。但是,我无法正确获取代码。我正在使用pam
anddaisy
函数。对于这个例子,我手动分配了 2 个集群 (k=2),因为我知道所需的结果,但实际上我探索了几个 k 值。
有谁知道我在这里做错了什么?
library(cluster)
library(dplyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))
gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k) # performs cluster analysis
pam_results <- dat %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)