2

我必须阅读 CSV(每个超过 120MB)。我使用了一个 for 循环,但它非常非常非常慢。如何更快地阅读 CSV?

我的代码:

H=data.frame()
for (i in 201:225){
    for (j in 1996:2007){
        filename=paste("D:/Hannah/CD/CD.R",i,"_cd",j,".csv",sep="")
        x=read.csv(filename,stringsAsFactor=F)
        I=c("051","041","044","54","V0262")
        temp=x[(x$A_1 %in% I)|(x$A_2 %in% I)|(x$A_3 %in% I), ]
        H=rbind(H,temp)
    }
}

每个文件的结构都是这样的

> str(x)
'data.frame':   417691 obs. of  37 variables:
$ YM: int  199604 199612 199612 199612 199606 199606 199609 199601 ...
$ A_TYPE: int  1 1 1 1 1 1 1 1 1 1 ...
$ HOSP: chr  "dd0516ed3e" "c53d67027e" ...
$ A_DATE: int  19960505 19970116 19970108  ...
$ C_TYPE: int  19 9 1 1 2 9 9 1 1 1 ...
$ S_NO : int  142 37974 4580 4579 833 6846 2272 667 447 211 ...
$ C_ITEM_1 : chr  "P2" "P3" "A2"...
$ C_ITEM_2 : chr  "R6" "I3" ""...
$ C_ITEM_3 : chr  "W2" "" "A2"...
$ C_ITEM_4 : chr  "Y1" "O3" ""...
$ F_TYPE: chr  "40" "02" "02" "02" ...
$ F_DATE : int  19960415 19961223 19961227  ...
$ T_END_DATE: int  NA NA NA  ...
$ ID_B : int  19630526 19630526 19630526  ...
$ ID : chr  "fff" "fac" "eab"...
$ CAR_NO : chr  "B4" "B5" "C1" "B6" ...
$ GE_KI: int  4 4 4 4 4 4 4 4 4 4 ...
$ PT_N : chr  "H10" "A10" "D10" "D10" ...
$ A_1  : chr  "0521" "7948" "A310" "A312" ...
$ A_2  : chr  "05235" "5354" "" "" ...
$ A_3  : chr  "" "" "" "" ...
$ I_O_CE: chr  "5210" "" "" "" ...
$ DR_DAY : int  0 7 3 3 0 0 3 3 3 3 ...
$ M_TYPE: int  2 0 0 0 2 2 0 0 0 0 ...

...........

4

3 回答 3

3

我认为这里最大的性能问题是您迭代地增长H对象。每次对象增长时,操作系统都需要为其分配更多。这个过程需要相当长的时间。一个简单的解决方法是预先分配H到正确的行数。如果事先不知道行数,您可以预先分配好数量,并根据需要调整大小。

或者,以下方法不会受到我上面描述的问题的影响:

list_of_files = list.files('dir_where_files_are', pattern = '*csv', full.names = TRUE)
big_data_frame = do.call('rbind', lapply(list_of_files, read.csv, sep = ""))
于 2013-08-18T07:04:58.290 回答
2

这可能不是最有效或最优雅的方法,但这是我要做的,基于一些缺少更多信息的假设;特别是,不能做任何测试:

确保RSQLite已安装(sqldf如果您有足够的内存,可能是一个选项,但我个人更喜欢拥有一个我也可以使用其他工具访问的“真实”数据库)。

# make sqlite available
library( RSQLite )
db <- dbConnect( dbDriver("SQLite"), dbname = "hannah.sqlite" )

# create a vector with your filenames
filenames <- NULL
for (i in 201:225)
{
    for ( j in 1996:2007 )
    {
        fname <- paste( "D:/Hannah/CD/CD.R", i, "_cd", j, ".csv", sep="" ) 
        filenames <- c( filenames, fname )
    }
}

# extract the DB structure, create empty table
x <- read.csv( filenames[1], stringsAsFactor = FALSE, nrows = 1 )
dbWriteTable( db, "all", x, row.names = FALSE )
dbGetQuery( db, "DELETE FROM all" )

# a small table for your selection criteria (build in flexibility for the future)
I <- as.data.frame( c( "051", "041", "044", "54", "V0262" ) )
dbWriteTable( db, "crit", I, row.names = FALSE )

# move your 300 .csv files into that table
# (you probably do that better using the sqlite CLI but more info would be needed)
for( f in filenames )
{
    x <- read.csv( f, stringsAsFactor = FALSE )
    dbWriteTable( db, "all", x, append = TRUE, row.names = FALSE )
}

# now you can extract the subset in one go
extract <- dbGetQuery( db, "SELECT * FROM all 
                       WHERE A_1 IN (SELECT I FROM crit ) OR
                             A_2 IN (SELECT I FROM crit ) OR
                             A_3 IN (SELECT I FROM crit )"   )

这未经测试但应该可以工作(如果没有,请告诉我它在哪里停止)并且它应该更快并且不会遇到内存问题。但同样,没有真实数据就没有真正的解决方案!

于 2013-08-18T08:07:11.207 回答
2

您还可以使用fread()包中的功能data.table。相比起来还是挺快的read.csv。另外,尝试只循环list.files().

于 2013-08-18T09:55:58.413 回答