4

In R, I'm looking for a memory-efficient way to create a summary of tabular data as follows.

Take for example the data.frame foo which I've used table() to summarize, followed by as.data.frame() to obtain the frequency counts.

foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- as.data.frame(table(foo), stringsAsFactors=F)

This results in the following frequency count for bar

   x  y Freq
1  a ab    1
2  b ab    0
3  a ac    1
4  b ac    0
5  a ad    1
6  b ad    0
7  a ae    0
8  b ae    1
9  a fx    0
10 b fx    1
11 a fy    0
12 b fy    1

The problem I'm running into is when there are many levels of x and y, it starts using up significant amounts of memory >64 GB. I was wondering if there was an alternative way of doing this kind of frequency count. As a first step, I set stringsAsFactors=F, however this doesn't completely solve the problem.

4

3 回答 3

4

我有这种快速(稀疏)交叉表的方法。我认为有进一步优化的可能性,但对于大型数据集对我来说已经足够了。关键是使用ninteractionfromplyr包快速为每一行生成一个数字id。

tab <- function(df, drop = TRUE) {
  id <- plyr::ninteraction(df)
  ord <- order(id)

  df <- df[ord, , drop = FALSE]
  id <- id[ord]

  freq <- rle(id)$lengths
  labels <- unrowname(df[cumsum(freq), , drop = FALSE])

  data.frame(labels, freq)
}
于 2010-04-26T18:32:16.107 回答
1
library(plyr)
ddply(foo, ~ x + y, nrow,.drop=FALSE)
于 2010-04-26T16:25:30.213 回答
1

查看包中执行稀疏交叉制表的xtabs方法。Matrix

于 2010-04-26T16:06:34.963 回答