2

我正在尝试执行一个非常简单的聚类分析,但无法得到正确的结果。我对大型数据集的问题是“哪些疾病经常一起报告?”。下面的简化数据样本应导致 2 个集群:1) 头痛/头晕 2) 恶心/腹痛。但是,我无法正确获取代码。我正在使用pamanddaisy函数。对于这个例子,我手动分配了 2 个集群 (k=2),因为我知道所需的结果,但实际上我探索了几个 k 值。

有谁知道我在这里做错了什么?

library(cluster)
library(dplyr)

dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
                  PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))


gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k)  # performs cluster analysis
pam_results <- dat %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))
head(pam_results$the_summary)
4

1 回答 1

4

将数据集提供给聚类算法的格式不适合您的目标。事实上,如果您想将报告的疾病分组在一起,但同时在相异矩阵中包含 ID,它们将在矩阵构造中占有一席之地,而您不希望这样,因为您的目标仅涉及疾病。

因此,我们需要建立一个数据集,其中每一行是一个患有他/她报告的所有疾病的患者,然后仅在数字特征上构建相异矩阵。presence对于这个任务,如果患者报告了疾病,我将添加一个值为 1 的列,否则为 0;pivot_wider函数(链接)将自动填充零。

这是我使用的代码,我想我达到了你想要的,如果是这样,请告诉我。

library(cluster)
library(dplyr)
library(tidyr)

dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
                  PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"),
                  presence = 1)
# build the wider dataset: each row is a patient
dat_wider <- pivot_wider(
    dat,
    id_cols = ID,
    names_from = PTName,
    values_from = presence,
    values_fill = list(presence = 0)
)

# in the dissimalirity matrix construction, we leave out the column ID
gower_dist <- daisy(dat_wider %>% select(-ID), metric = "gower")
k <- 2

set.seed(123)
pam_fit <- pam(gower_dist, diss = TRUE, k) 
pam_results <- dat_wider %>%
    mutate(cluster = pam_fit$clustering) %>%
    group_by(cluster) %>%
    do(the_summary = summary(.))
head(pam_results$the_summary)

此外,由于您只使用二进制数据,如果它们更适合您的数据,您可以考虑使用简单匹配Jaccard距离而不是 Gower 距离。在 R 中,您可以使用它们

sm_dist <- dist(dat_wider %>% select(-ID), method = "manhattan")/p
j_dist <- dist(dat_wider %>% select(-ID), method = "binary")

分别p是您要考虑的二进制变量的数量。

于 2020-02-28T08:54:44.013 回答