r - R：循环处理大数据集（GB）的块？

Question

我有一个以 GB 为单位的大型数据集，我必须在分析它们之前对其进行处理。我尝试创建一个连接器，它允许我循环遍历大型数据集并一次提取块。这允许我隔离满足某些条件的数据。

我的问题是我无法为连接器创建一个指标，规定它为空，并在到达数据集末尾时执行 close(connector) 。此外，对于提取的第一块数据，我必须跳过 17 行，因为该文件包含 R 无法读取的标头。

有效的手动尝试：

filename="nameoffile.txt"    
con<<-file(description=filename,open="r")    
data<-read.table(con,nrows=1000,skip=17,header=FALSE)    
data<-read.table(con,nrows=1000,skip=0,header=FALSE)    
.    
.    
.    
till end of dataset

由于我想避免手动键入上述命令，直到我到达数据集的末尾，我试图编写一个循环来自动化该过程，但没有成功。

我尝试失败的循环：

filename="nameoffile.txt"    
con<<-file(description=filename,open="r")    
data<-read.table(con,nrows=1000,skip=17,header=FALSE)        
if (nrow(rval)==0) {    
  con <<-NULL    
  close(con)    
  }else{    
    if(nrow(rval)!=0){    
    con <<-file(description=filename, open="r")    
    data<-read.table(conn,nrows=1000,skip=0,header=FALSE)      
  }}

score 10 · Accepted Answer

看起来你在正确的轨道上。只需打开一次连接（您不需要使用<<-，只需<-；使用更大的块大小，以便可以使用 R 的矢量化操作来有效地处理每个块），沿线

filename <- "nameoffile.txt"
nrows <- 1000000
con <- file(description=filename,open="r")    
## N.B.: skip = 17 from original prob.! Usually not needed (thx @Moody_Mudskipper)
data <- read.table(con, nrows=nrows, skip=17, header=FALSE)
repeat {
    if (nrow(data) == 0)
        break
    ## process chunk 'data' here, then...
    ## ...read next chunk
    if (nrow(data) != nrows)   # last chunk was final chunk
        break
    data <- tryCatch({
        read.table(con, nrows=nrows, skip=0, header=FALSE)
    }, error=function(err) {
       ## matching condition message only works when message is not translated
       if (identical(conditionMessage(err), "no lines available in input"))
          data.frame()
       else stop(err)
    })
}
close(con)

在我看来，迭代是一个很好的策略，特别是对于您将要处理一次而不是像数据库一样重复引用的文件。修改了答案，以尝试在文件末尾检测读取更加健壮。

r - R：循环处理大数据集（GB）的块？

1 回答 1

Related

Reference