r - 在 R 散点图中插入新矩阵

Question

我想在我的散点图中插入来自另一个矩阵的新坐标。我正在使用 fviz_cluster 函数为集群生成图形。我想在我的图表中插入称为质心的矩阵坐标，因为它们是安装肥料堆肥机的每个集群的最佳坐标。我只能为附加的属性生成散点图。代码如下：

> library(readxl)
> df <- read_excel('C:/Users/testbase.xlsx') #matrix containing waste production, latitude and longitude
> dim (df)
[1] 19  3
> d<-dist(df)
> fit.average<-hclust(d,method="average") 
> clusters<-cutree(fit.average, k=6) 
> df$cluster <- clusters # inserting column with determination of clusters
> df
    Latitude    Longitude  Waste   cluster
     <dbl>       <dbl>     <dbl>     <int>
 1    -23.8     -49.6      526.        1
 2    -23.8     -49.6      350.        2
 3    -23.9     -49.6      526.        1
 4    -23.9     -49.6      469.        3
 5    -23.9     -49.6      285.        4
 6    -23.9     -49.6      175.        5
 7    -23.9     -49.6      175.        5
 8    -23.9     -49.6      350.        2
 9    -23.9     -49.6      350.        2
10    -23.9     -49.6      175.        5
11    -23.9     -49.7      350.        2
12    -23.9     -49.7      175.        5
13    -23.9     -49.7      175.        5
14    -23.9     -49.7      364.        2
15    -23.9     -49.7      175.        5
16    -23.9     -49.6      175.        5
17    -23.9     -49.6      350.        2
18    -23.9     -49.6      45.5        6
19    -23.9     -49.6      54.6        6

> ########Generate scatterplot
> library(factoextra)
> fviz_cluster(list(data = df, cluster = clusters))
> 
> 
>  ##Center of mass, best location of each cluster for installation of manure composting machine
> center_mass<-matrix(nrow=6,ncol=2)
> for(i in 1:6){
+ center_mass[i,]<-c(weighted.mean(subset(df,cluster==i)$Latitude,subset(df,cluster==i)$Waste),
+ weighted.mean(subset(df,cluster==i)$Longitude,subset(df,cluster==i)$Waste))}
> center_mass<-cbind(center_mass,matrix(c(1:6),ncol=1)) #including the index of the clusters
> head (center_mass)
          [,1]      [,2] [,3]
[1,] -23.85075 -49.61419    1
[2,] -23.86098 -49.64558    2
[3,] -23.86075 -49.61350    3
[4,] -23.86658 -49.61991    4
[5,] -23.86757 -49.63968    5
[6,] -23.89749 -49.62372    6

在此处输入图像描述

新散点图

在此处输入图像描述

考虑经度和纬度的散点图

vars = c("Longitude", "Latitude")

gg <- fviz_cluster(list(df, cluster = dfcluster), choose.var=vars)

gg

在此处输入图像描述

score 0 · Accepted Answer

这个答案显示了使用包的fviz_cluster()功能的解决方案factoextra，而不是我之前的答案中包含的模拟示例。

从 OP 发布的数据框开始，该数据框已经包含hclust()and找到的集群cutree()：

structure(list(Latitude = c(-23.8, -23.8, -23.9, -23.9, -23.9, 
-23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, 
-23.9, -23.9, -23.9, -23.9, -23.9), Longitude = c(-49.6, -49.6, 
-49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.7, 
-49.7, -49.7, -49.7, -49.7, -49.6, -49.6, -49.6, -49.6), Waste = c(526, 
350, 526, 469, 285, 175, 175, 350, 350, 175, 350, 175, 175, 364, 
175, 175, 350, 45.5, 54.6), cluster = c(1L, 2L, 1L, 3L, 4L, 5L, 
5L, 2L, 2L, 5L, 2L, 5L, 5L, 2L, 5L, 5L, 2L, 6L, 6L)), class = "data.frame",
row.names = c(NA, -19L))

我们首先使用以下方法生成集群图fviz_cluster()：

library(factoextra)

# Analysis variables (used when computing the clusters)
vars = c("Latitude", "Longitude", "Waste")

# Initial plot showing the clusters on the first 2 PCs
gg <- fviz_cluster(list(data = df, cluster = df$cluster), choose.vars=vars)
gg

这使：

请注意，此图与 OP 显示的图不同。原因是 OP 使用的代码使cluster存在的变量df包含在绘图所基于的主成分的计算中。原因是输入数据框中的所有变量都用于生成绘图。（这个结论是通过查看源代码fviz_cluster()并在调试模式下运行得出的。）

现在我们计算每个集群的Waste加权中心以及每个集群的平均值Waste（需要在下面将中心添加到图中）：（
请注意，代码现在已推广到找到的任意数量的集群）

# Number of clusters found
n_clusters = length( unique(df$cluster) )

# Waste-weighted cluster centers
center_mass <- matrix(nrow=n_clusters, ncol=2, dimnames=list(NULL, c("Latitude", "Longitude")))
for(i in 1:n_clusters) {
  center_mass[i,] <- c(weighted.mean(subset(df,cluster==i)$Latitude,subset(df,cluster==i)$Waste),
                       weighted.mean(subset(df,cluster==i)$Longitude,subset(df,cluster==i)$Waste))
}

# We now compute the average Waste by cluster since,
# in order to add the centers to the fviz_cluster() plot
# we need the information for all three variables used
# in the clustering analysis and generation of the plot
center_mass_with_waste = cbind(center_mass, aggregate(Waste ~ cluster, mean, data=df))
head(center_mass_with_waste)

这使：

   Latitude Longitude cluster    Waste
1 -23.85000 -49.60000       1 526.0000
2 -23.88344 -49.63377       2 352.3333
3 -23.90000 -49.60000       3 469.0000
4 -23.90000 -49.60000       4 285.0000
5 -23.90000 -49.64286       5 175.0000
6 -23.90000 -49.60000       6  50.0500

现在最有趣的部分开始了：将加权中心添加到图中。由于绘图是在主成分轴上完成的，我们需要计算中心的主成分坐标。

这是通过对完整数据运行主成分分析 (PCA) 并将 PCA 轴旋转应用于中心坐标来实现的。R的包中至少有两个函数stats可以用来运行 PCA：prcomp()和princomp(). 首选方法是prcomp()（因为它使用奇异值分解来执行特征分析，并使用通常N-1的除数作为方差的除数，而不是N使用的princomp()）。另外prcomp()是使用的功能fviz_cluster()。

所以：

# We first scale the analysis data as we will need the center and scale information
# to properly center and scale the weighted centers for plotting
# Note that proper PCA is always done on centered and scaled data
# in order to accommodate different variable scales and make variables comparable.
# in addition, this is what is done inside fviz_cluster().
X <- scale( df[,vars] )

# We run PCA on the scaled data
summary( pca <- prcomp(X, center=FALSE, scale=FALSE) )

这使：

Importance of components:
                          PC1    PC2    PC3
Standard deviation     1.2263 0.9509 0.7695
Proportion of Variance 0.5012 0.3014 0.1974
Cumulative Proportion  0.5012 0.8026 1.0000

观察前 2 个 PC 的解释方差比例与集群初始图中所示的比例一致，即：分别为 50.1% 和 30.1%。

我们现在对加权中心进行中心化和缩放，使用对完整数据执行的相同中心和缩放操作（这是绘图所必需的）：

# We center and scale the weighted centers
# (based on the information stored in the attributes of X)
center_mass_with_waste_scaled = scale(center_mass_with_waste[, vars],
                                      center=attr(X, "scaled:center"),
                                      scale=attr(X, "scaled:scale"))

# We compute the PC coordinates for the centers
center_mass_with_waste_pcs = predict(pca, center_mass_with_waste[,vars])

最后，我们将Waste加权中心添加到图中（作为红色填充点）并将Waste值作为标签。在这里，我们区分分析变量的数量 (nvars) = 2 或 > 2，因为fviz_cluster()仅在 nvars > 2 时执行 PCA，在 nvars = 2 的情况下，它只是缩放变量。

# And finally we add the points to the plot (as red filled points)
# distinguishing two cases, because fviz_cluster() does different things
# in each case (i.e. no PCA when nvars = 2, just scaling)
if (length(vars) > 2) {
  # fviz_cluster() performs PCA and plots the first 2 PCs
  # => use PC coordinates for the centers
  gg + geom_point(data=as.data.frame(center_mass_with_waste_pcs),
                  mapping=aes(x=PC1, y=PC2),
                  color="red", size=3) +
       geom_text(data=as.data.frame(pca$x),
                 mapping=aes(x=PC1, y=PC2, label=df$Waste),
                 size=2, hjust=-0.5)
} else {
  # fviz_cluster() does NOT perform PCA; it simply plots the standardized variables
  # => use standardized coordinates for the centers

  # Get the names of the analysis variables as expressions (used in aes() below)
  vars_expr = parse(text=vars)
  gg + geom_point(data=as.data.frame(center_mass_with_waste_scaled),
                  mapping=aes(x=eval(vars_expr[1]), y=eval(vars_expr[2])),
                  color="red", size=3) +
       geom_text(data=as.data.frame(X),
                 mapping=aes(x=eval(vars_expr[1]), y=eval(vars_expr[2]), label=df$Waste),
                 size=2, hjust=-0.5)
}

这给出（当 nvars = 3 时）：

但是请注意，红点基本上与计算的原始聚类中心一致，fiz_cluster()这是因为和的Waste加权平均值Latitude与Longitude它们各自的非加权平均值几乎相同（此外，两种计算方法之间唯一略有不同的中心是集群 2 的中心——通过比较每个集群的加权平均值和未加权平均值可以看出（此处未完成）。

score 0 · Accepted Answer

由于该fviz_cluster()函数返回一个ggplot对象，因此您应该能够像使用ggplot().

这是一个使用模拟数据的示例，其中我只使用ggplot2包中的函数（因为我没有factoextra安装包）。

# Dataset with all the points (it's your df data frame)
df <- data.frame(x=1:10, y=1:10)

# Dataset with two "center" points to add to the df points (it's your center_mass matrix)
dc <- data.frame(x=c(2.5, 7.5), y=c(2.5, 7.5))

# ggplot with the initial plot of the df points (it mimics the result from fviz_cluster())
# Note that the plot is not yet shown, it's simply stored in the gg variable
gg <- ggplot() + geom_point(data=df, mapping=aes(x,y))

# Create the plot by adding the center points to the above ggplot as larger red points
gg + geom_point(data=dc, mapping=aes(x,y), color="red", size=3)

产生：

在您的情况下，您应该：

将行：替换为
fviz_cluster(list(data = df, cluster = clusters))
：
gg <- fviz_cluster(list(data = df, cluster = clusters))
在将矩阵传递给上面示例的最后一行中的调用之前，将center_mass矩阵转换为数据框（通过简单地使用as.data.frame(center_mass)），并为您可以在选项中引用geom_point()的函数分配适当的列名。colnames()mappinggeom_point()

让我知道这是否适合您！

r - 在 R 散点图中插入新矩阵

2 回答 2

Related

Reference