r - pca 和集群分析，计算速度很慢

Question

我的数据有 30,000 行和 140 列，我正在尝试对数据进行聚类。我正在做一个 pca，然后使用大约 12 个 pc 用于聚类分析。我随机抽取了 3000 个观察样本并运行它，运行 pca 和层次聚类需要 44 分钟。

一位同事在 SPSS 中做了同样的事情，而且花费的时间明显减少了？知道为什么吗？

这是我的代码的简化版本，它运行良好，但在超过 2000 次观察时确实很慢。我包含了非常小的 USArrest 数据集，因此它并不能真正代表我的问题，但显示了我正在尝试做的事情。我对发布大型数据集犹豫不决，因为这似乎很粗鲁。

我不确定如何加快集群速度。我知道我可以对数据进行随机抽样，然后使用预测函数将集群分配给测试数据。但最佳情况下，我想使用集群中的所有数据，因为数据是静态的，永远不会改变或更新。

library(factoextra)
library(FactoMineR)       
library(FactoInvestigate) 

## Data

# mydata = My data has 32,000 rows with 139 variables.
# example data with small data set 
data("USArrests")
mydata <- USArrests

## Initial PCA on mydata

res.pca <- PCA(mydata, ncp=4, scale.unit=TRUE, graph = TRUE)

Investigate(res.pca)  # this report is very helpful!  I determined to keep 12 PC and start with 3 clusters.

## Keep PCA dataset with only 2 PC
res.pca1 <- PCA(mydata, ncp=2, scale.unit=TRUE, graph = TRUE)

## Run a HC on the PC:  Start with suggested number of PC and Clusters 
res.hcpc <- HCPC(res.pca1, nb.clust=4, graph = FALSE)

## Dendrogram
fviz_dend(res.hcpc,
          cex = 0.7, 
          palette = "jco",
          rect = TRUE, rect_fill = TRUE, 
          rect_border = "jco", 
          labels_track_height = 0.8 
)

## Cluster Viz
fviz_cluster(res.hcpc,
             geom = "point",  
             elipse.type = "convex", 
             #repel = TRUE, 
             show.clust.cent = TRUE, 
             palette = "jco", 
             ggtheme = theme_minimal(),
             main = "Factor map"
)


#### Cluster 1: Means of Variables
res.hcpc$desc.var$quanti$'1'

#### Cluster 2: Means of Variables
res.hcpc$desc.var$quanti$'2'

#### Cluster 3: Means of Variables
res.hcpc$desc.var$quanti$'3'

#### Cluster 4: Means of Variables
res.hcpc$desc.var$quanti$'4'

#### Number of Observations in each cluster
cluster_hd = res.hcpc$data.clust$clust
summary(cluster_hd)

知道为什么 SPSS 这么快吗？

知道如何加快速度吗？我知道集群是劳动密集型的，但我不确定效率的门槛是多少，我的 30,000 条记录和 140 个变量的数据。

其他一些集群包是否更有效？建议？

score 1 · Accepted Answer

HCPC 是使用 Ward 标准对主成分进行的层次聚类。您可以使用 k-means 算法代替聚类部分，这要快得多：分层聚类的时间复杂度为 O(n³)，而 k-means 的复杂度为 O(n)，其中 n 是观察次数。

由于通过 k-means 优化的标准和使用 Ward 的层次聚类是相同的（最小化总聚类内方差），您可以使用具有大量聚类（例如 300 个）的第一个 k-means，然后运行层次聚类如果您需要保持分层方面，请在集群的中心。

r - pca 和集群分析，计算速度很慢

1 回答 1

Related

Reference