我为大约 1000 个人中的每一个提供了一堆位置。总数据集过去约为 250 万,我的处理脚本运行大约需要 20 个小时。然而,现在,我有 2400 万个观察结果,我认为我需要清理我的代码,如果可以的话,也许可以使用并行处理。
我一直在使用 rgeos 包中的 gDistance 函数来执行此操作,并且一直在运行一系列循环(我知道,我知道)以分解我每个人的处理。我花了很多时间试图弄清楚如何以某种方式将其移动到 plyr/dplyr 语法中,但无法完全理解。我认为部分问题与我的对象类 SpatialPoint 和 SpatialPoylgonDataFrames 有关。
# Create SpatialPointsDataFrame
# My actual dataset has 24 million observations
my.pts <- data.frame(LONGITUDE=c(-85.4,-84.7,-82.7,-82.7,-86.5,-88.9,-94.8,-83.9,-87.8,-82.8),
coordinates(my.pts) <- c("LONGITUDE","LATITUDE")
# Create two polygons in a SpatialPolygonsDataFrame
# My actual dataset has 71 polygons (U.S. counties)
x1 <- data.frame(x=c(-92.3, -92.3, -90.7, -90.7, -92.3, -92.3),y=c(27.6, 29.4, 29.4, 27.6, 27.6, 27.6))
x1 <- as.data.frame(x1)
x1 <- Polygon(rbind(x1,x1[1,]))
x2 <- data.frame(x=c(-85.2, -85.2, -83.3, -83.2, -85.2, -85.2),y=c(26.4, 26.9, 26.9, 26.0, 26.4, 26.4))
x2 <- as.data.frame(x2)
x2 <- Polygon(rbind(x2,x2[1,]))
poly1 <- Polygons(list(x1),"poly1")
poly2 <- Polygons(list(x2),"poly2")
myShp <- SpatialPolygons(list(poly1,poly2),1:2)
sdf <- data.frame(ID=c(1,2))
row.names(sdf) <- c("poly1","poly2")
myShp <- SpatialPolygonsDataFrame(myShp,data=sdf)
# I have been outputting my results to a list. With this small sample, it's easy to just put everything into the object county.vec. But I worry that the 24 million x 71 object would not be feasible. The non-loop version shows the output I've been getting more easily.
COUNTY.LIST <- list()
county.vec <- gDistance(my.pts, myShp, byid=TRUE)
COUNTY.LIST[[1]] = apply(county.vec, 2, min)
COUNTY.LIST[[2]] = apply(county.vec, 2, which.min)
COUNTY.LIST[[3]] = my.pts$INDEX
# I have been putting it into a loop so that county.vec gets dumped for each version of the loop.
# Seems like this could be done using dlply perhaps? And then I would have the power of parallel processing?
idx <- unique(my.pts$MYID)
COUNTY.LIST <- list()
for(i in 1:length(idx)){
COUNTY.LIST[[i]] <- list()
county.vec <- gDistance(my.pts[my.pts$MYID==idx[i],], myShp, byid=TRUE)
COUNTY.LIST[[i]][[1]] = apply(county.vec, 2, min)
COUNTY.LIST[[i]][[2]] = apply(county.vec, 2, which.min)
COUNTY.LIST[[i]][[3]] = my.pts$MY[my.pts$MYID==idx[i]]
dlply(my.pts,.(MYID),gDistance(my.pts, myShp, byid=TRUE),.parallel=TRUE)
> dlply(my.pts,.(MYID),gDistance(my.pts, myShp, byid=TRUE))
Error in eval.quoted(.variables, data) :
envir must be either NULL, a list, or an environment.
# I suspect this error is because my.pts is a SpatialPointsPolygon. I also recognize that my function call probably isn't right, but first things first.
# I tried another way to reference the MYID field, more inline with treatment of S4 objects...
dlply(my.pts,my.pts@data$MYID,gDistance(my.pts, myShp, byid=TRUE),.parallel=TRUE)
# It yields the same error.