r - 矢量版本/矢量化 R 中的等于循环

Question

我有一个值向量，称为 X，还有一个数据框，称为 dat.fram。我想运行“grep”或“which”之类的东西来查找与 X 的每个元素匹配的 dat.fram[,3] 的所有索引。

这是我下面的非常低效的 for 循环。请注意，X 中有许多观察值，并且“match.ind”的每个成员都可以有零个或多个匹配项。此外，dat.fram 有超过 100 万个观测值。有没有办法在 R 中使用向量函数来提高这个过程的效率？

最终，我需要一个列表，因为我会将列表传递给另一个函数，该函数将从 dat.fram 检索适当的值。

代码：

match.ind=list()

for(i in 1:150000){
    match.ind[[i]]=which(dat.fram[,3]==X[i])
}

score 1 · Accepted Answer

更新：

好吧，哇，我刚刚找到了一个很棒的方法……它真的很漂亮。想知道它在其他情况下是否有用......？！

### define v as a sample column of data - you should define v to be 
### the column in the data frame you mentioned (data.fram[,3]) 

v = sample(1:150000, 1500000, rep=TRUE)

### now here's the trick: concatenate the indices for each possible value of v,
### to form mybiglist - the rownames of mybiglist give you the possible values
### of v, and the values in mybiglist give you the index points

mybiglist = tapply(seq_along(v),v,c)

### now you just want the parts of this that intersect with X... again I'll
### generate a random X but use whatever X you need to

X = sample(1:200000, 150000)
mylist = mybiglist[which(names(mybiglist)%in%X)]

就是这样！作为检查，让我们看一下 mylist 的前 3 行：

> mylist[1:3]

$`1`
[1]  401143  494448  703954  757808 1364904 1485811

$`2`
[1]  230769  332970  389601  582724  804046  997184 1080412 1169588 1310105

$`4`
[1]  149021  282361  289661  456147  774672  944760  969734 1043875 1226377

在 3 处有一个间隙，因为 3 没有出现在 X 中（即使它出现在 v 中）。针对 4 列出的数字是 v 中出现 4 的索引点：

> which(X==3)
integer(0)

> which(v==3)
[1]  102194  424873  468660  593570  713547  769309  786156  828021  870796  
883932 1036943 1246745 1381907 1437148

> which(v==4)
[1]  149021  282361  289661  456147  774672  944760  969734 1043875 1226377

最后，值得注意的是，出现在 X 中但不在 v 中的值不会在列表中包含条目，但这大概是您想要的，因为它们是 NULL！

额外说明：您可以使用下面的代码为不在 v 中的 X 的每个成员创建一个 NA 条目...

blanks = sort(setdiff(X,names(mylist)))
mylist_extras = rep(list(NA),length(blanks))
names(mylist_extras) = blanks
mylist_all = c(mylist,mylist_extras)
mylist_all = mylist_all[order(as.numeric(names(mylist_all)))]

相当不言自明：mylist_extras 是一个包含您需要的所有其他列表内容的列表（名称是 X 的值，不在名称（mylist）中，并且列表中的实际条目只是 NA）。最后两行首先合并 mylist 和 mylist_extras，然后执行重新排序，使 mylist_all 中的名称按数字顺序排列。然后，这些名称应该与向量 X 中的（唯一）值完全匹配。

干杯! :)

下面的原始帖子......显然被上面的帖子取代了！

这是一个带有 tapply 的玩具示例，它可能运行得更快……我将 X 和 d 设置得相对较小，这样你就可以看到发生了什么：

X = 3:7
n = 100
d = data.frame(a = sample(1:10,n,rep=TRUE), b = sample(1:10,n,rep=TRUE), 
               c = sample(1:10,n,rep=TRUE), stringsAsFactors = FALSE)

tapply(X,X,function(x) {which(d[,3]==x)})

r - 矢量版本/矢量化 R 中的等于循环

1 回答 1

Related

Reference