继续实验:
构建一个临时工作文件:
n <- 1e7
dd <- data.frame(time=1:n,id=rep("a",n),x=1:n,y=1:n)
fn <- tempfile()
write.table(dd,file=fn,sep="\t",row.names=FALSE,col.names=FALSE)
read.table
使用(指定和不colClasses
指定)和运行 10 次复制scan
:
编辑:更正scan
响应评论的调用,更新结果:
library(rbenchmark)
(b1 <- benchmark(read.table(fn,
col.names=c("time","id","x","y"),
colClasses=c("integer",
"NULL","NULL","NULL")),
read.table(fn,
col.names=c("time","id","x","y")),
scan(fn,
what=list(integer(),NULL,NULL,NULL)),replications=10))
结果:
2 read.table(fn, col.names = c("time", "id", "x", "y"))
1 read.table(fn, col.names = c("time", "id", "x", "y"),
colClasses = c("integer", "NULL", "NULL", "NULL"))
3 scan(fn, what = list(integer(), NULL, NULL, NULL))
replications elapsed relative user.self sys.self
2 10 278.064 1.857016 232.786 30.722
1 10 149.737 1.011801 141.365 2.388
3 10 143.118 1.000000 140.617 2.105
(警告,这些值有点熟/不一致,因为我重新运行了基准测试并合并了结果......但定性结果应该没问题)。
read.table
withoutcolClasses
是最慢的(这并不奇怪),但仅 (?) 比scan
本示例慢 85%。 只比指定scan
的快一点。read.table
colClasses
可以编写一个“分块”版本,使用and ( ) 或scan
( )参数一次读取文件的位,然后在最后将它们粘贴在一起。我不知道这会减慢多少进程,但它会允许在块之间调用......read.table
skip
nrows
read.table
n
scan
txtProgressBar