4

假设我有以下数据:

set.seed(1)
test <- data.frame(letters=rep(c("A","B","C","D"),10), numbers=sample(1:50, 40, replace=TRUE))

我想知道有多少个字母A不在B里面,有多少个数字B不在里面C等等。

我想出了一个使用基本函数的解决方案,split并且mapply

s.test <-split(test, test$letters)
notIn <- mapply(function(x,y) sum(!s.test[[x]]$numbers %in% s.test[[y]]$numbers), x=names(s.test)[1:3], y=names(s.test)[2:4])

这使:

> notIn
A B C 
9 7 7 

但我也想用dplyror来做这件事data.table。是否可以?

4

2 回答 2

4

瓶颈似乎在split。当对 200 个组和每组 150,000 个观察值进行模拟时split,总共需要 54 秒中的 50 秒。

split使用以下方法可以大大加快该步骤data.table

## test is a data.table here
s.test <- test[, list(list(.SD)), by=letters]$V1

data.table这是使用+对您的尺寸数据进行的基准测试mapply

## generate data
set.seed(1L)
k = 200L
n = 150000L
test <- data.frame(letters=sample(paste0("id", 1:k), n*k, TRUE), 
                 numbers=sample(1e6, n*k, TRUE), stringsAsFactors=FALSE)

require(data.table)   ## latest CRAN version is v1.9.2
setDT(test)           ## convert to data.table by reference (no copy)
system.time({
    s.test <- test[, list(list(.SD)), by=letters]$V1 ## split
    setattr(s.test, 'names', unique(test$letters))   ## setnames
    notIn <- mapply(function(x,y) 
         sum(!s.test[[x]]$numbers %in% s.test[[y]]$numbers), 
              x=names(s.test)[1:199], y=names(s.test)[2:200])
})

##   user  system elapsed 
##  4.840   1.643   6.624 

That's about ~7.5x speedup on your biggest data dimensions. Would this be sufficient?

于 2014-03-23T02:56:16.353 回答
4

This seems to give about the same speedup as with data.table but only uses base R. Instead of splitting the data frame it splits the numbers column only (in line marked ##):

## generate data - from Arun's post
set.seed(1L)
k = 200L
n = 150000L
test <- data.frame(letters=sample(paste0("id", 1:k), n*k, TRUE), 
                 numbers=sample(1e6, n*k, TRUE), stringsAsFactors=FALSE)

system.time({
    s.numbers <- with(test, split(numbers, letters)) ##
    notIn <- mapply(function(x,y) 
         sum(!s.numbers[[x]] %in% s.numbers[[y]]), 
              x=names(s.numbers)[1:199], y=names(s.numbers)[2:200])
})
于 2014-03-23T04:07:41.257 回答