我想知道是否有人可以查看以下代码和最小示例并提出改进建议 - 特别是在处理非常大的数据集时代码的效率。
该函数采用 data.frame 并将其按分组变量(因子)拆分,然后计算每组中所有行的距离矩阵。
我不需要保留距离矩阵——只需要一些统计数据,即平均值、直方图 ..,然后它们可以被丢弃。
我对内存分配等知之甚少,我想知道最好的方法是什么,因为我将处理每组 10.000 - 100.000 个案例。任何想法将不胜感激!
此外,如果我遇到严重的内存问题,将 bigmemory 或其他一些大型数据处理包包含到函数中最不痛苦的方法是什么?
FactorDistances <- function(df) {
# df is the data frame where the first column is the grouping variable.
# find names and number of groups in df (in the example there are three:(2,3,4)
factor.names <- unique(df[1])
n.factors <-length(unique(df$factor))
# split df by factor into list - each subset dataframe is one list element
df.l<-list()
for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]}
# use lapply to go through list and calculate distance matrix for each group
# this results in a new list where each element is a distance matrix
distances <- lapply (df.l, function(x) dist(x[,2:length(x)], method="minkowski", p=2))
# again use lapply to get the mean distance for each group
means <- lapply (distances, mean)
rm(distances)
gc()
return(means)
}
df <- data.frame(cbind(factor=rep(2:4,2:4), rnorm(9), rnorm(9)))
FactorDistances(df)
# The result are three average euclidean distances between all pairs in each group
# If a group has only one member, the value is NaN
编辑:我编辑了标题以反映我作为答案发布的分块问题。