r - 从人群中选择最不同的个体的最佳方法是什么？

Question

我尝试使用 k-means 聚类来选择我的人口中最多样化的标记，例如，如果我们想选择 100 行，我将整个人口聚类到 100 个聚类，然后从每个聚类中选择最接近质心的标记。

我的解决方案的问题是花费了太多时间（可能我的功能需要优化），尤其是当标记的数量超过 100000 时。

因此，如果有人能向我展示一种选择标记的新方法，以最大限度地提高我的种群多样性和/或帮助我优化我的功能以使其更快地工作，我将非常感激。

谢谢

# example:

library(BLR)
data(wheat)
dim(X)
mdf<-mostdiff(t(X), 100,1,nstart=1000)

这是我使用的 mostdiff 函数：

mostdiff <- function(markers, nClust, nMrkPerClust, nstart=1000) {
    transposedMarkers <- as.array(markers)
    mrkClust <- kmeans(transposedMarkers, nClust, nstart=nstart)
    save(mrkClust, file="markerCluster.Rdata")

    # within clusters, pick the markers that are closest to the cluster centroid
    # turn the vector of which markers belong to which clusters into a list nClust long
    # each element of the list is a vector of the markers in that cluster

    clustersToList <- function(nClust, clusters) {
        vecOfCluster <- function(whichClust, clusters) {
            return(which(whichClust == clusters))
        }
        return(apply(as.array(1:nClust), 1, vecOfCluster, clusters))
    }

    pickCloseToCenter <- function(vecOfCluster, whichClust, transposedMarkers, centers, pickHowMany) {
        clustSize <- length(vecOfCluster)
        # if there are fewer than three markers, the center is equally distant from all so don't bother
        if (clustSize < 3) return(vecOfCluster[1:min(pickHowMany, clustSize)])

        # figure out the distance (squared) between each marker in the cluster and the cluster center
        distToCenter <- function(marker, center){
            diff <- center - marker    
            return(sum(diff*diff))
        }

        dists <- apply(transposedMarkers[vecOfCluster,], 1, distToCenter, center=centers[whichClust,])
        return(vecOfCluster[order(dists)[1:min(pickHowMany, clustSize)]]) 
    }
}

score 1 · Accepted Answer

您可以尝试以下类似的方法，尽管我认为您的代码中最慢的部分实际上是kmeans. 对于大型数据集，您可以根据数据的形状考虑减少nstart参数或子集。

library(plyr)

markers <- data.frame(x=rnorm(1e6), y=rnorm(1e6), z=rnorm(1e6))

mostdiff <- function(markers, iter.max=1e5) {
    ncols <- ncol(markers)

    km <- kmeans(markers, 100, iter.max=iter.max)

    markers$cluster <- km$cluster
    markers$d <- rowSums(apply(
        markers[,1:ncols] - km$centers[markers$cluster], 2, function(x) x * x
    ))

    result <- subset(
        merge(
            ddply(markers, ~cluster, summarise, d=min(d)),
            markers,
            x.all=T, y.all=F
        ),
        select=-c(d, cluster)
    )

    return(result)
}

mostdiff(markers, 100)

score 1 · Accepted Answer

如果这kmeans是最消耗的部分，您可以将 k-means 算法应用于人口的随机子集。如果与您选择的质心数量相比，随机子集的大小仍然很大，您将获得几乎相同的结果。或者，您可以在几个子集上运行多个 kmeans 并合并结果。

另一种选择是尝试k-medoid算法，该算法将给出作为总体一部分的质心，因此不需要找到最接近其质心的每个集群的成员的第二部分。虽然它可能比 k-means 慢。

score 0 · Accepted Answer

如果您正在寻找人口中的异常值，而不一定是用来识别它们的“标记”，我建议使用mahalanobis distance。它通常是异常值识别的首选第一线工具。

k <- 1000 # Number of outliers from the population we want
n <- length(x)
ma.dist <- mahalanobis(x, colMeans(x), cov(x))
ix <- order(ma.dist)
mdf <- x[ix >= n - k]

score 0 · Accepted Answer

以防其他任何人试图做同样的事情。这是基于 damienfrancois 建议的答案：除了使用原始数据，pam k-medriod 允许我们使用自己的距离矩阵，这在某些情况下非常重要，因为我们在标记数据中有很多缺失值。

library(BLR)

data(wheat)

library(cluster)

pam_out<-pam(t(X),100)

selec.markers<-as.data.frame(colnames(X)[pam_out$id.med])

r - 从人群中选择最不同的个体的最佳方法是什么？

4 回答 4

Related

Reference