r - 参考查找后将 data.table 与其自身合并

Question

如果我有data.tables DTand neighbors：

set.seed(1)
library(data.table)
DT <- data.table(idx=rep(1:10, each=5), x=rnorm(50), y=letters[1:5], ok=rbinom(50, 1, 0.90))
n <- data.table(y=letters[1:5], y1=letters[c(2:5,1)])

n是一个查找表。无论何时ok == 0，我都想查找相应y1的 inn并将该值用于x给定的值idx。举个例子，DT的第4行：

> DT
   idx          x y ok
1:   1 -0.6264538 a  1
2:   1  0.1836433 b  1
3:   1 -0.8356286 c  1
4:   1  1.5952808 d  0
5:   1  0.3295078 e  1
6:   2 -0.8204684 a  1

y1来自nford是e：_

> n[y == 'd']
   y y1
1: d  e

第 4行idx是 1。所以我会使用：

> DT[idx == 1 & y == 'e', x]
[1] 0.3295078

我希望我的输出与所有值都替换为相应的 n['y1']值data.table一样：DT[ok == 0]xx

> output
   idx          x y ok
1:   1  0.3295078 d  0
2:   2 -0.3053884 d  0
3:   3  0.3898432 a  0
4:   5  0.7821363 a  0
5:   7  1.3586800 e  0
6:   8  0.7631757 d  0

我可以想出几种使用base R或plyr...的方法来做到这一点，也许周五晚些时候...但是无论这需要什么合并序列data.table都超出了我的范围！

score 8 · Accepted Answer

好问题。使用其他答案中的函数并将 Blue 的答案包装到一个函数blue中，以下如何。基准包括setkey所有情况下的时间。

red = function() {
    ans = DT[ok==0]
      # Faster than setkey(DT,ok)[J(0)] if the vector scan is just once
      # If lots of lookups to "ok" need to be done, then setkey may be worth it
      # If DT[,ok:=as.integer(ok)] can be done first, then ok==0L slightly faster

    # After extracting ans in the original order of DT, we can now set the key :
    setkey(DT,idx,y)
    setkey(n,y)

    # Now working with the reduced ans ...

    ans[,y1:=n[y,y1,mult="first"]]
    # Add a new column y1 by reference containing the lookup in n
    # mult="first" because we know n's key is unique, for speed (to save looking
    # for groups of matches in n). Future version of data.table won't need this.
    # Also, mult="first" has the advantage of dropping group columns (so we don't
    # need [[2L]]). mult="first"|"last" turns off by-without-by of mult="all".

    ans[,x:=DT[ans[,list(idx,y1)],x,mult="first"]]
    # Changes the contents of ans$x by reference. The ans[,list(idx,y1)] part is
    # how to pick the columns of ans to join to DT's key when they are not the key
    # columns of ans and not the first 1:n columns of ans. There is no need to key
    # ans, especially since that would change ans's order and not strictly answer
    # the question. If idx and y1 were columns 1 and 2 of (unkeyed) ans then we
    # wouldn't need that part, just
    #    ans[,x:=DT[ans,x,mult="first"]]
    # would do (relying on DT having 2 columns in its key). That has the advantage
    # of not copying the idx and y1 columns into a new data.table to pass as the i
    # DT. To save that copy y1 could be moved to column 2 using setcolorder first.

    redans <<- ans
    }

crdt(1e5)
origDT = copy(DT)
benchmark(blue={DT=copy(origDT); system.time(blue())},
          red={DT=copy(origDT); system.time(red())},
          fun={DT=copy(origDT); system.time(fun(DT,n))},
          replications=3, order="relative")

test replications elapsed relative user.self sys.self user.child sys.child
 red            3   1.107    1.000     1.100    0.004          0         0
blue            3   5.797    5.237     5.660    0.120          0         0
 fun            3   8.255    7.457     8.041    0.184          0         0

crdt(1e6)
[ .. snip .. ]
test replications elapsed relative user.self sys.self user.child sys.child
 red            3  14.647    1.000    14.613    0.000          0         0
blue            3  87.589    5.980    87.197    0.124          0         0
 fun            3 197.243   13.466   195.240    0.644          0         0

identical(blueans[,list(idx,x,y,ok,y1)],redans[order(idx,y1)])
# [1] TRUE

中order需要，identical因为在 tie 的情况下，red返回结果的顺序与DT[ok==0]whileblue似乎是 order by相同。y1idx

如果y1结果中不需要它，可以使用ans[,y1:=NULL];立即将其删除（无论表格大小）即，这可以包含在上面以产生所要求的确切结果，而完全不影响时间。

score 4 · Accepted Answer

library(data.table)

crdt <- function(i=10){
 set.seed(1)
 DT <<- data.table(idx=rep(1:i, each=5), x=rnorm(5*i), 
                   y=letters[1:5], ok=rbinom(5*i, 1, 0.90))
 n <<- data.table(y=letters[1:5], y1=letters[c(2:5,1)])
} 

fun <- function(DT,n){
 setkey(DT,ok)
 n1 <- merge(n,DT[J(0),list(y,idx)],by="y")
 DT[J(0),x:=DT[paste0(y,idx) %in% paste0(n1[,y1],n1[,idx]),x]]
} 

crdt(10)
fun(DT,n)[J(0)]
     ok idx          x y
[1,]  0   1  0.3295078 d
[2,]  0   2 -0.3053884 d
[3,]  0   3  0.3898432 a
[4,]  0   5  0.7821363 a
[5,]  0   7  1.3586796 e
[6,]  0   8  0.7631757 d

但是对于更大的 data.tables，它仍然很慢：

crdt(1e6)
system.time(fun(DT,n)[J(0)])
       User      System     elapsed 
      4.213       0.162       4.374 

crdt(1e7)
system.time(fun(DT,n)[J(0)])
       User      System     elapsed 
    195.685       3.949     199.592

我有兴趣学习更快的解决方案。

score 4 · Accepted Answer

超级复杂的答案：

setkey(
    setkey(
        setkey(DT,y)[setkey(n,y),nomatch=0] #inner joins DT to n
    #matches the new x value by idx and y, and assigns it
    ,idx,y1)[setkey(J(idx,y,new.x=x),idx,y),x:=new.x] 
,ok)[list(0)] #pulls things where ok == 0

看起来 Roland 的答案更适合较小的桌子，但我的最终会赶上更大的尺寸。不过，我还没有做很多检查。

> library(rbenchmark)
> benchmark(fun(DT,n)[J(0)],setkey(setkey(setkey(DT,y)[setkey(n,y),nomatch=0],idx,y1)[setkey(J(idx,y,new.x=x),idx,y),x:=new.x],ok)[list(0)])
                                                                                                                                  test
1                                                                                                                     fun(DT, n)[J(0)]
2 setkey(setkey(setkey(DT, y)[setkey(n, y), nomatch = 0], idx, y1)[setkey(J(idx, y, new.x = x), idx, y), `:=`(x, new.x)], ok)[list(0)]
  replications elapsed relative user.self sys.self user.child sys.child
1          100   13.21 1.000000     13.08     0.02         NA        NA
2          100   15.08 1.141559     14.76     0.06         NA        NA
> crdt(1e5)
> benchmark(fun(DT,n)[J(0)],setkey(setkey(setkey(DT,y)[setkey(n,y),nomatch=0],idx,y1)[setkey(J(idx,y,new.x=x),idx,y),x:=new.x],ok)[list(0)])
                                                                                                                                  test
1                                                                                                                     fun(DT, n)[J(0)]
2 setkey(setkey(setkey(DT, y)[setkey(n, y), nomatch = 0], idx, y1)[setkey(J(idx, y, new.x = x), idx, y), `:=`(x, new.x)], ok)[list(0)]
  replications elapsed relative user.self sys.self user.child sys.child
1          100  150.49 1.000000    148.98     0.89         NA        NA
2          100  155.33 1.032162    151.04     2.25         NA        NA
>

r - 参考查找后将 data.table 与其自身合并

3 回答 3

Related

Reference