我正在使用 rgdal 包在 R 中处理图像分类脚本。有问题的栅格是一个具有 28 个通道的 PCIDSK 文件:一个训练数据通道、一个验证数据通道和 26 个光谱数据通道。目标是填充一个数据帧,其中包含训练数据通道中不为零的每个像素的值,以及 26 个波段中的相关光谱值。
在 Python/Numpy 中,我可以轻松地将整个图像的所有波段导入多维数组,但是,由于内存限制,R 中的唯一选项似乎是逐块导入此数据,这非常慢:
library(rgdal)
raster = "image.pix"
training_band = 2
validation_band = 1
BlockWidth = 500
BlockHeight = 500
# Get some metadata about the whole raster
myinfo = GDALinfo(raster)
ysize = myinfo[[1]]
xsize = myinfo[[2]]
numbands = myinfo[[3]]
# Iterate through the image in blocks and retrieve the training data
column = 0
training_data = NULL
while(column < xsize){
if(column + BlockWidth > xsize){
BlockWidth = xsize - column
}
row = 0
while(row < ysize){
if(row + BlockHeight > ysize){
BlockHeight = ysize - row
}
# Do stuff here
myblock = readGDAL(raster, region.dim = c(BlockHeight,BlockWidth), offset = c(row, column), band = c(training_band,3:numbands), silent = TRUE)
blockdata = matrix(NA, dim(myblock)[1], dim(myblock)[2])
for(i in 1:(dim(myblock)[2])){
bandname = paste("myblock", names(myblock)[i], sep="$")
blockdata[,i]= as.matrix(eval(parse(text=bandname)))
}
blockdata = as.data.frame(blockdata)
blockdata = subset(blockdata, blockdata[,1] > 0)
if (dim(blockdata)[1] > 0){
training_data = rbind(training_data, blockdata)
}
row = row + BlockHeight
}
column = column + BlockWidth
}
remove(blockdata, myblock, BlockHeight, BlockWidth, row, column)
有没有更快/更好的方法来做同样的事情而不会耗尽内存?
收集此训练数据后的下一步是创建分类器(randomForest 包),该分类器也需要大量内存,具体取决于请求的树数。这让我想到了第二个问题,即考虑到训练数据已经占用的内存量,不可能创建一个由 500 棵树组成的森林:
myformula = formula(paste("as.factor(V1) ~ V3:V", dim(training_data)[2], sep=""))
r_tree = randomForest(formula = myformula, data = training_data, ntree = 500, keep.forest=TRUE)
有没有办法分配更多的内存?我错过了什么吗?谢谢...
[编辑] 正如 Jan 所建议的,使用“raster”包要快得多;但是据我所知,就收集训练数据而言,它并不能解决内存问题,因为它最终需要在数据帧中,在内存中:
library(raster)
library(randomForest)
# Set some user variables
fn = "image.pix"
training_band = 2
validation_band = 1
# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
training_class[training_class == 0] = NA
training_class = Which(training_class != 0, cells=TRUE)
training_data = extract(myraster, training_class)
training_data = as.data.frame(training_data)
所以虽然这要快得多(并且需要更少的代码),但它仍然不能解决没有足够的可用内存来创建分类器的问题......是否有一些我还没有找到的“光栅”包函数可以完成这个? 谢谢...