我尝试使用 k-means 聚类来选择我的人口中最多样化的标记,例如,如果我们想选择 100 行,我将整个人口聚类到 100 个聚类,然后从每个聚类中选择最接近质心的标记。
我的解决方案的问题是花费了太多时间(可能我的功能需要优化),尤其是当标记的数量超过 100000 时。
因此,如果有人能向我展示一种选择标记的新方法,以最大限度地提高我的种群多样性和/或帮助我优化我的功能以使其更快地工作,我将非常感激。
谢谢
# example:
library(BLR)
data(wheat)
dim(X)
mdf<-mostdiff(t(X), 100,1,nstart=1000)
这是我使用的 mostdiff 函数:
mostdiff <- function(markers, nClust, nMrkPerClust, nstart=1000) {
transposedMarkers <- as.array(markers)
mrkClust <- kmeans(transposedMarkers, nClust, nstart=nstart)
save(mrkClust, file="markerCluster.Rdata")
# within clusters, pick the markers that are closest to the cluster centroid
# turn the vector of which markers belong to which clusters into a list nClust long
# each element of the list is a vector of the markers in that cluster
clustersToList <- function(nClust, clusters) {
vecOfCluster <- function(whichClust, clusters) {
return(which(whichClust == clusters))
}
return(apply(as.array(1:nClust), 1, vecOfCluster, clusters))
}
pickCloseToCenter <- function(vecOfCluster, whichClust, transposedMarkers, centers, pickHowMany) {
clustSize <- length(vecOfCluster)
# if there are fewer than three markers, the center is equally distant from all so don't bother
if (clustSize < 3) return(vecOfCluster[1:min(pickHowMany, clustSize)])
# figure out the distance (squared) between each marker in the cluster and the cluster center
distToCenter <- function(marker, center){
diff <- center - marker
return(sum(diff*diff))
}
dists <- apply(transposedMarkers[vecOfCluster,], 1, distToCenter, center=centers[whichClust,])
return(vecOfCluster[order(dists)[1:min(pickHowMany, clustSize)]])
}
}