r - 仅通过 2 列键的第 2 列对 data.table 进行子集，使用二进制搜索而不是矢量扫描

Question

我最近在data.table. 如果表格按多个键排序，是否只能搜索第二个键？

DT = data.table(x=sample(letters,1e7,T),y=sample(1:25,1e7,T),rnorm(1e7))
setkey(DT,x,y)
#R> DT[J('x')]
#        x  y       V3
#     1: x  1  0.89109
#     2: x  1 -2.01457
#    ---              
#384922: x 25  0.09676
#384923: x 25  0.25168
#R> DT[J('x',3)]
#       x y       V3
#    1: x 3 -0.88165
#    2: x 3  1.51028
#   ---             
#15383: x 3 -1.62218
#15384: x 3 -0.63601

编辑：感谢@Arun

R> system.time(DT[J(unique(x), 25)])
   user  system elapsed 
  0.220   0.068   0.288 
R> system.time(DT[y==25])
   user  system elapsed 
  0.268   0.092   0.359

score 18 · Accepted Answer

Yes, you can pass all values to the first key value and subset with the specific value for the second key.

DT[J(unique(x), 25), nomatch=0]

If you need to subset by more than one value in the second key (e.g. the equivalent of DT[y %in% 25:24]), a more general solution is to use CJ

DT[CJ(unique(x), 25:24), nomatch=0]

Note that CJ by default sorts the columns and sets key to all the columns, which means the result would be sorted as well. If that's not desirable, you should use sorted=FALSE

DT[CJ(unique(x), 25:24, sorted=FALSE), nomatch=0]

There's also a feature request to add secondary keys to data.table in future. I believe the plan is to add a new function set2key.

FR#1007 Build in secondary keys

There is also merge, which has a method for data.table. It builds the secondary key inside it for you, so should be faster than base merge. See ?merge.data.table.

score 6 · Accepted Answer

基于这个电子邮件线程，我编写了以下函数：

create_index = function(dt, ..., verbose = getOption("datatable.verbose")) {
  cols = data.table:::getdots()
  res = dt[, cols, with=FALSE]
  res[, i:=1:nrow(dt)]
  setkeyv(res, cols, verbose = verbose)
}

JI = function(index, ...) {
  index[J(...),i]$i
}

以下是我的系统上的结果，其 DT 较大（1e8 行）：

> system.time(DT[J("c")])
   user  system elapsed 
  0.168   0.136   0.306 

> system.time(DT[J(unique(x), 25)])
   user  system elapsed 
  2.472   1.508   3.980 
> system.time(DT[y==25])
   user  system elapsed 
  4.532   2.149   6.674 

> system.time(IDX_y <- create_index(DT, y))
   user  system elapsed 
  3.076   2.428   5.503 
> system.time(DT[JI(IDX_y, 25)])
   user  system elapsed 
  0.512   0.320   0.831

如果您多次使用索引，这是值得的。

r - 仅通过 2 列键的第 2 列对 data.table 进行子集，使用二进制搜索而不是矢量扫描

2 回答 2

Related

Reference