r - 通过 not head(key(DT),m) 对 data.table 进行子集，使用二进制搜索而不是矢量扫描

Question

如果我将 n 列指定为 a 的键data.table，我知道只要加入 of ，我就可以加入比该键中定义的更少的head列key(DT)。例如，对于 n=2 ：

X = data.table(A=rep(1:5, each=2), B=rep(1:2, each=5), key=c('A','B'))
X
    A B
 1: 1 1
 2: 1 1
 3: 2 1
 4: 2 1
 5: 3 1
 6: 3 2
 7: 4 2
 8: 4 2
 9: 5 2
10: 5 2

X[J(3)]
   A B
1: 3 1
2: 3 2

在那里，我只加入了 2 列键的第一列DT。我知道我可以像这样加入键的两列：

X[J(3,1)]
   A B
1: 3 1

但是我如何仅使用键的第二列（例如B==2）进行子集化，但仍然使用二进制搜索而不是矢量扫描？我知道这是重复的：

仅通过 2 列键的第 2 列对 data.table 进行子集，使用二进制搜索而不是矢量扫描

所以我想把这个问题概括为n. 我的数据集有大约一百万行，上面链接的 dup 问题中提供的解决方案似乎不是最佳的。

score 5 · Accepted Answer

Here is a simple function that will extract the correct unique values and return a data table to use as a key.

 X <- data.table(A=rep(1:5, each=4), B=rep(1:4, each=5), 
                  C = letters[1:20], key=c('A','B','C'))
 make.key <- function(ddd, what){
  # the names of the key columns
  zzz <- key(ddd)
  # the key columns you wish to keep all unique values
  whichUnique <- setdiff(zzz, names(what))
  ## unique data.table (when keyed); .. means "look up one level"
  ud <-  lapply([, ..whichUnique], unique)
  ## append the `what` columns and  a Cross Join of the new
  ## key columns
  do.call(CJ, c(ud,what)[zzz])
}   

X[make.key(X, what = list(C = c('a','b'))),nomatch=0]
## A B C
## 1: 1 1 a
## 2: 1 1 b

I'm not sure this will be any quicker than a couple of vector scans on a large data.table though.

score 2 · Accepted Answer

添加辅助键在功能请求列表中：

FR#1007 内置辅助键

与此同时，我们陷入了向量扫描，或者问题中链接的 n=2 案例的答案中使用的方法（@mnel 在他的回答中很好地概括了这一点）。

r - 通过 not head(key(DT),m) 对 data.table 进行子集，使用二进制搜索而不是矢量扫描

2 回答 2

Related

Reference