r - 使用 read.table 时监控 R 数据加载进度

Question

我为其他类型的数据加载找到了很多答案，但在 R 使用read.table(...). 我有一个简单的命令：

data = read.table(file=filename,
                sep="\t",
                col.names=c("time","id","x","y"),
                colClasses=c("integer","NULL","NULL","NULL"))

这会在大约 30 秒左右加载大量数据，但进度条会非常好:-D

score 2 · Accepted Answer

继续实验：

构建一个临时工作文件：

n <- 1e7
dd <- data.frame(time=1:n,id=rep("a",n),x=1:n,y=1:n)
fn <- tempfile()
write.table(dd,file=fn,sep="\t",row.names=FALSE,col.names=FALSE)

read.table使用（指定和不colClasses指定）和运行 10 次复制scan：

编辑：更正scan响应评论的调用，更新结果：

library(rbenchmark)
(b1 <- benchmark(read.table(fn,
                     col.names=c("time","id","x","y"),
                            colClasses=c("integer",
                              "NULL","NULL","NULL")),
                 read.table(fn,
                            col.names=c("time","id","x","y")),
          scan(fn,
               what=list(integer(),NULL,NULL,NULL)),replications=10))

结果：

2 read.table(fn, col.names = c("time", "id", "x", "y"))
1 read.table(fn, col.names = c("time", "id", "x", "y"), 
      colClasses = c("integer", "NULL", "NULL", "NULL"))
3  scan(fn, what = list(integer(), NULL, NULL, NULL))

  replications elapsed relative user.self sys.self 
2           10 278.064 1.857016   232.786   30.722    
1           10 149.737 1.011801   141.365    2.388  
3           10 143.118 1.000000   140.617    2.105

（警告，这些值有点熟/不一致，因为我重新运行了基准测试并合并了结果......但定性结果应该没问题）。

read.tablewithoutcolClasses是最慢的（这并不奇怪），但仅 (?) 比scan本示例慢 85%。只比指定scan的快一点。read.tablecolClasses

可以编写一个“分块”版本，使用and ( ) 或scan( )参数一次读取文件的位，然后在最后将它们粘贴在一起。我不知道这会减慢多少进程，但它会允许在块之间调用......read.tableskipnrowsread.tablenscantxtProgressBar

r - 使用 read.table 时监控 R 数据加载进度

1 回答 1

Related

Reference