我有几个非常大的数据集(.csv 文件,从 4 到 9 GB)。我使用ff和ffbase包将它们加载到 R 中并计算能量消耗值的每日平均值、总和和最大值。该脚本适用于 19 个文件中的 15 个,但现在它突然不再起作用了。我仍然认为自己是 R 的新手,我只是在学习如何处理这些巨大的文件。
这是脚本(在此处找到:使用 R 中的 ffdfdply 函数进行聚合):
library(tidyverse)
library(ff) # to work with files 2 - 10 GB
library(ffbase)
#creating file
tab.ff <- read.csv.ffdf(file = "file.csv")
#creates a ffdf object
class(tab.ff)
str(tab.ff)
# split by date -> assuming that all data of 1 date can fit into RAM
splitby <- as.character(tab.ff$Date, by = 250000)
grp_qty <- ffdfdply(x=tab.ff[c("Date","ODBA.Sm","VeDBA.smoothed")],
split=splitby,
FUN = function(tab.ff){
## This happens in RAM - containing **several** split elements so here we can use data.table which works fine for in RAM computing
require(data.table)
tab.ff <- as.data.table(tab.ff)
result <- tab.ff[, list(ODBA_sum = sum(ODBA.Sm, na.rm = TRUE), VeDBA_sum = sum(VeDBA.smoothed, na.rm = TRUE),
ODBA_mean = mean(ODBA.Sm, na.rm = TRUE), VeDBA_mean = mean(VeDBA.smoothed, na.rm = TRUE),
ODBA_max = max(ODBA.Sm, na.rm = TRUE), VeDBA_max = max(VeDBA.smoothed, na.rm = TRUE)), by = list(Date)]
as.data.frame(result)
})
dim(grp_qty)
grp_qty # look at it
# export as csv file
write.csv.ffdf(grp_qty, file = "file.csv")
所以正如我所说,它适用于 15 个文件,但有四个文件在使用 ffdfdply 时会出现以下错误:
2021-11-02 17:53:05, calculating split sizes
Error in grouprunningcumsum(x = as.integer(splitgroups$tab), max = MAXSIZE) :
NAs in foreign function call (arg 3)
In addition: Warning message:
In grouprunningcumsum(x = as.integer(splitgroups$tab), max = MAXSIZE) :
NAs introduced by coercion to integer range
如果有人知道如何解决这个问题,或者可能以另一种方式按日期聚合/汇总平均值、总和和最大值,我将不胜感激。提前致谢!