r - R - ff 包：查找 ffdf 中出现频率最高的元素，并删除所在行

Question

我需要一个建议来找到 ffdf 中最常见的元素，然后删除所在的行。我决定尝试使用 ff 包，因为我正在处理非常大的数据并且基本 RI 内存不足。

这是一个小例子：

 # create a base R Matrix

 > z<-matrix(c("a", "b", "a", "c", "b", "b", "c", "c", "b", "a"),nrow=5,ncol=2,byrow = TRUE)
 > z


     [,1] [,2]
 [1,] "a"  "b" 
 [2,] "a"  "c" 
 [3,] "b"  "b" 
 [4,] "c"  "c" 
 [5,] "b"  "a" 


 # convert z to ffdf

 > u=as.data.frame(z, stringsAsFactors=TRUE)
 > u=as.ffdf(u)
 > u

  ffdf data
   V1 V2
1  a  b
2  a  c
3  b  b
4  c  c
5  b  a

我在找：

导出 ffdf 中出现频率最高的元素（在本例中为“b”）
从ffdf中删除“b”所在的所有行

因此，新的 ffdf 必须如下：

   V1 V2
1  a  c
2  c  c

在基础 RI 中找到了“table”函数的方法

  temp <- table(as.vector(z))  
  t1<-names(temp)[temp == max(temp)] 
  z1<- z[rowSums(z== t1[1]) == 0, ]

但是处理大量数据我需要 ff 包之类的东西。

score 1 · Accepted Answer

require(ff)
z <- matrix(c("a","b","f","c","f","b","e","c","b","e"),nrow=5,ncol=2,byrow = TRUE)
u <- as.data.frame(z, stringsAsFactors=TRUE)
u <- as.ffdf(u)
u

以下应该适用于任何大小的数据集。它使用来自 ffbase 的 table.ff 和 ffwhich，来自 ff 的 ffrowapply 和基于 ff 整数向量的索引。

require(ffbase)
require(plyr)
## Detect most frequent item (assuming the levels of all columns can be different)
columnfreqs <- lapply(colnames(u), FUN=function(column) table(u[[column]]))
columnfreqs <- lapply(columnfreqs, FUN=function(x) as.data.frame(t(as.matrix(x))))
itemfreqs <- colSums(do.call(rbind.fill, columnfreqs), na.rm=TRUE)
mostfrequent <- names(sort(itemfreqs, decreasing = TRUE))[1]

## Identify the lines where the most frequent item occurs in each row of the ffdf 
idx <- ffrowapply(
  EXPR = apply(u[i1:i2,], MARGIN=1, FUN=function(row) any(row %in% mostfrequent)), 
  X=u, 
  RETURN = TRUE, FF_RETURN = TRUE, RETCOL = NULL, VMODE = "logical")
idx <- ffwhich(idx, idx != TRUE) # remove it is in there + convert logicals to integers

## Remove them
u[idx, ]

r - R - ff 包：查找 ffdf 中出现频率最高的元素，并删除所在行

1 回答 1

Related

Reference