非常感谢 jwijffels 引导我朝着正确的方向前进,并感谢 http://rmazing.wordpress.com/2013/02/22/bigcor-large-correlation-matrices-in-r/ 让我朝着正确的方向开始。
假设一个 7000x180 的数据矩阵称为training.data
. 目标是创建一个尺寸为 7000x7000 的对称距离矩阵。实际上,使用daisy()
创建了一个不同的度量,但它是相似的逻辑。
distff <- function(training.data, nblocks=5, verbose=TRUE) {
require(ff)
require(cluster)
ffmat <- ff(vmode="single", dim=c(7000,7000), filename="if so desired")
nro <- nrow(training.data)
### This could be changed to handle rowcounts that have
### modulus(nro/nblocks) != 0
splt <- split(1:nro, rep(1:nblocks, each = nro/nblocks))
COMBS <- expand.grid(1:length(splt), 1:length(splt))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)
for (i in 1:nrow(COMBS)) {
COMB <- COMBS[i,]
### Since g1 and g2 get appended below, it wouldn't make sense to append the
### same group to itself
if (COMB[1] != COMB[2]) {
g1 <- splt[[COMB[1]]]
g2 <- splt[[COMB[2]]]
slj <- as.matrix(daisy(training.data[c(g1,g2),], metric="gower",
stand=FALSE))
ffmat[c(g1,g2), c(g1,g2)] <- slj
rm(slj)
gc()
}
}
}
而已。我意识到有一些效率低下(比如多次编写几个组)。我没关系,因为它有效。就像我说的,这段代码的大部分是从上面引用的网站借用和定制的。