我正在构建一个聚类算法以用于我尚未见过的数据,因此我同时使用了一些伪数据。PAM 的结果表明我没有任何孤立的集群,但使用 TSNE 的 ggplot 显示我有结构良好的集群。我怀疑这是由于我的虚假数据造成的。有人对为什么会这样有任何想法吗?
这是数据,请注意,Age 和 howOld 代表不同的东西:
library(dplyr)
library(cluster)
library(Rtsne)
library(ggplot2)
set.seed(1987)
n = 350
clust_dat <-
data.frame(personId = 1:n,
networkPref = sample(c("topic", "jobtitle", "orgtype"),
size = n, replace = TRUE,
prob = c(0.56, 0.20, 0.24)),
Age = sample(23:65, size = n, replace = TRUE),
familyImp = sample(c(1, 2, 3, 4, 5), size = n, replace = TRUE,
prob = c(0.02, 0.01, 0.10, 0.4, 0.83)),
howOld = sample(25:30, size = n, replace = TRUE,
prob = c(.40, .30, .20, .05, .03, .02)),
horror = sample(c("Yes", "No"), size = n, replace = TRUE,
prob = c(0.27, 0.73)),
sailBoat = sample(c("Yes", "No"), size = n, replace = TRUE,
prob = c(0.58, 0.42)))
这是我在第一次定义我的序数变量级别后构建的模型
clust_dat$familyImp <- factor(clust_dat$familyImp,
levels = c("1", "2", "3", "4", "5"),
ordered = TRUE)
gower_dist <- daisy(clust_dat[, -1], metric = "gower")
gower_matrix <- as.matrix(gower_dist)
#find silhouette width for many PAM models
sil_width <- c(NA)
for (i in 2:ceiling(nrow(clust_dat)/9)) {
pam_fit <- pam(gower_dist,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
#build PAM model with best silhouette width
pam_fit <- pam(gower_dist, diss = TRUE, k = which.max(sil_width))
在 PAM 上获取隔离信息时,我得到:
pam_fit$isolation
1 2 3 4 5 6 7 8 9 10 11 12
no no no no no no no no no no no no
Levels: no L L*
但是绘图显示了一些结构良好的集群
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <-
tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering),
name = clust_dat$personId)
ggplot(tsne_data, aes(x = X, y = Y)) +
geom_point(aes(color = cluster))
有任何想法吗?如果我删除所有连续变量,我会得到非常未定义的集群,但有些被认为是孤立的......