r - 从 ffdf 对象对数据框进行计算

Question

我正在处理一个大型数据集（350 万行和 40 列），我需要清除一些值，以便在我开始围绕数据构建模型时计算我需要的其他参数。

问题是应用我一直在使用的 for 循环需要很长时间，所以我想尝试使用 ff 包。数据框称为数据，它由银行的一堆客户信息组成。它被导入为 .csv 文件。我需要做的是删除所有客户（标记为 Serial），如果他们的 AverageStanding 变量为负数

> ffd<-as.ffdf(data)
> lastserial = tail(ffd$Serial,1)
> for(k in 1:lastserial){
+   tempvecWith <- vector()
+   tempvecWith <- ffd[ffd$Serial==k, ]$AverageStanding
+   if(any(tempvecWith < 0)){
+     ffd_clean<- ffd[!ffd$Serial ==k, ]
+   }
+ }

这是我收到的错误：

Error in as.hi.integer(x, maxindex = maxindex, dim = dim, vw = vw, pack = pack) : 
NAs in as.hi.integer

关于如何避免这些错误的任何想法？

score 1 · Accepted Answer

错误来自您的这部分代码ffd[ffd$Serial==k, ]。即ffd$Serial==k返回一个 ff 逻辑向量。但是，如果要索引或子集 ff 向量或 ffdf，则需要提供索引号，而不是逻辑向量。您可以使用 ffbase 包中的 ffwhich 将逻辑向量 ff 转换为索引号 ff 向量。

因此，对于您的问题，我相信您正在寻找这种代码（未经测试，因为您没有提供任何数据）。

require(ffbase)
idx <- ffd$AverageStanding < 0
idx <- ffwhich(idx, idx==TRUE)
open(ffd)
serials.with.negative <- ffd$Serial[idx]
serials.with.negative <- unique(serials.with.negative)
ffd$is.customer.with.negative.avgstanding <- ffd$Serial %in% serials.with.negative

idx <- ffd$is.customer.with.negative.avgstanding == FALSE
idx <- ffwhich(idx, idx==TRUE)
open(ffd)
ffd_clean <- ffd[idx, ]

r - 从 ffdf 对象对数据框进行计算

1 回答 1

Related

Reference