r - 在 data.table 中的向量中查找所有匹配项

Question

我有一个 id 的向量sampleIDs。我还有一个 data.table ，rec_data_table由 bid 键入并包含一列， A_IDs.list其中每个元素都是 aID 的集合（向量）。

我想创建第二个 data.table ，其中包含sampleIDs和 where
For each aID，有一个对应的所有 bID 的向量，
该 aID 出现在A_IDs.list列中。

例子：

> rec_data_table
   bid counts names_list A_IDs.list
1: 301     21        C,E       3,NA
2: 302     21          E         NA
3: 303      5      H,E,G     8,NA,7
4: 304     10        H,D        8,4
5: 305      3          E         NA
6: 306      5          G          7
7: 307      6        B,C        2,3

> sampleIDs
[1] 3 4 8

AB.dt <- data.table(aID=sampleIDs, key="aID")

# unkown step
AB.dt[ , bIDs := ????  ]

# desired result:
> AB.dt
    aid     bIDs
1:    3  301,307
2:    4      304
3:    8  303,304

AB.dt[]我在通话中尝试了几条不同的线路。我能得到的最接近的是

rec_data_table[sapply(A_IDs.list, function(lst) aID %in% lst), bID]

这将为我提供给定的所需结果aID，我可以重复创建向量列表并构建所需的结果。
sampleIDs

但是，我怀疑必须有一个更“适合 data.table”的方法来实现这一点。任何建议表示赞赏。

#--------------------------------------------------#
#           SAMPLE DATA                            #

library(data.table)
set.seed(101)

  rows <- size <- 7
  varyingLengths <- c(sample(1:3, rows, TRUE))
  A <-  lapply(varyingLengths, function(n) sample(LETTERS[1:8], n))
  counts <- round(abs(rnorm(size)*12))   
rec_data_table <- data.table(bID=300+(1:size), counts=counts, names_list=A, key="bID")

A_ids.DT <- data.table(name=LETTERS[c(1:4,6:8,10:11)], id=c(1:4,6:8,10:11), key="name")
rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))]
sampleIDs <- c(3, 4, 8)

score 2 · Accepted Answer

在我对上一个问题的回答中加入tmpto之后A_ids.DT，您可以通过查找来获得所需的sampleIDs输出tmp：

# ... from previous answer
# tmp <- A_ids.DT[tmp]

AB.dt <- setkey(tmp, id)[J(sampleIDs)][, list(bIDs = list(bID)),
                                       by = list(aid = id)]

# setkey(tmp, orig.order)
# previous answer continues ...

Note that the capitalization of your bID column is different in these two questions, however. This is assuming, of course, that you are not executing the second to last line in your sample data. This ought to be faster than %in%-based approaches when there are many records due to the wonders of data.table's binary search.

score 1 · Accepted Answer

我认为这可以提供您想要的输出：

myfun <- function(ids) {
  any(ids %in% sampleIDs)
}

rec_data_table[sapply(A_IDs.list, myfun),]

#    bID counts names_list A_IDs.list
# 1: 301     21        C,E       3,NA
# 2: 303      5      H,E,G     8,NA,7
# 3: 304     10        H,D        8,4
# 4: 307      6        B,C        2,3

rec_data_table[sapply(A_IDs.list, myfun), list(bID, A_IDs.list)]

#   bID A_IDs.list
# 1: 301       3,NA
# 2: 303     8,NA,7
# 3: 304        8,4
# 4: 307        2,3

您可以unlist在A_IDs.list列上使用来获取长 data.table：

unique(na.omit(rec_data_table[sapply(A_IDs.list, myfun), list(bID, unlist(A_IDs.list))]))

#    bID V2
# 1: 301  3
# 2: 304  8
# 3: 301  7
# 4: 303  8
# 5: 304  4
# 6: 307  2

我建议使用“长”数据而不是上面的嵌套列表构造，因为它通常会导致更简单的代码。

score 0 · Accepted Answer

bIDs <- lapply(sampleIDs, function(x){rec_data_table$bID[sapply(rec_data_table$A_IDs.list, function(y){x %in% y})]})
AB.dt <- data.table(aID=sampleIDs, bIDs=bIDs)

也许有更快的方法，但这一个有效。:)

r - 在 data.table 中的向量中查找所有匹配项

3 回答 3

Related

Reference