disk.frame
填补 RAM 处理和大数据之间的空白看起来很有趣。
为了测试它,我创建了一个 200 * 200 Mb CSV 文件的集合,总大小为 40Gb,高于我计算机上安装的 32Gb RAM:
library(furrr)
library(magrittr)
library(data.table)
library(dplyr)
library(disk.frame)
plan(multisession,workers = 11)
nbrOfWorkers()
#[1] 11
filelength <- 1e7
# Create 200 files * 200Mb
sizelist <- 1:200 %>% future_map(~{
mydf <- data.table(week = sample(1:52,filelength,replace=T),
list_of_id=sample(1:filelength,filelength,replace=T))
filename <- paste0('data/test',.x,'.csv')
data.table::fwrite(mydf, filename)
##write.csv(mydf,file=filename)
file.size(filename)
})
sum(unlist(sizelist))
# [1] 43209467799
作为distinct_n
一个dplyr
动词,我首先停留在dplyr
语法上:
setup_disk.frame()
#The number of workers available for disk.frame is 6
options(future.globals.maxSize = Inf)
mydf = csv_to_disk.frame(file.path('data',list.files('data')))
"
csv_to_disk.frame: Reading multiple input files.
Please use `colClasses = ` to set column types to minimize the chance of a failed read
=================================================
-----------------------------------------------------
-- Converting CSVs to disk.frame -- Stage 1 of 2:
Converting 200 CSVs to 60 disk.frames each consisting of 60 chunks
Progress: ──────────────────────────────────────────────────────────────── 100%
-- Converting CSVs to disk.frame -- Stage 1 or 2 took: 00:01:44 elapsed (0.130s cpu)
-----------------------------------------------------
-----------------------------------------------------
-- Converting CSVs to disk.frame -- Stage 2 of 2:
Row-binding the 60 disk.frames together to form one large disk.frame:
Creating the disk.frame at c:\TempWin\RtmpkNkY9H\file398469c42f1b.df
Appending disk.frames:
Progress: ──────────────────────────────────────────────────────────────── 100%
Stage 2 of 2 took: 59.9s elapsed (0.370s cpu)
-----------------------------------------------------
Stage 1 & 2 in total took: 00:02:44 elapsed (0.500s cpu)"
result <- mydf %>%
group_by(week) %>%
summarize(value = n_distinct(list_of_id)) %>%
collect
result
# A tibble: 52 x 2
week value
<int> <int>
1 1 9786175
2 2 9786479
3 3 9786222
4 4 9785997
5 5 9785833
6 6 9786013
7 7 9786586
8 8 9786029
9 9 9785674
10 10 9786314
# ... with 42 more rows
所以它有效!用于此特定任务的总 RAM 内存在 1 到 5Gb 之间波动,在 6 个处理器上处理 20 亿行所需的时间不到 10 分钟,限制因素似乎是磁盘访问速度而不是处理器性能。
我还测试了data.table
语法,因为disk.frame
两者都接受,但是我返回的速度太快了 60 倍多的行(好像从 200 个 CSV 中创建的 60 个 disk.frames 没有合并和/或完全处理),还有很多Warning messages: 1: In serialize(data, node$con)
.
我在GitHub 上提交了一个问题。
在澄清这一点之前,我建议保留dplyr
有效的语法。
这个例子让我相信,允许为支持的动词disk.frame
处理大于 RAM 的数据