performance - 将集合转换为 R 中的列索引的有效方法是什么？

Question

概述

给出一个大的（nrows > 5,000,000+）数据框A，其中包含字符串行名和不相交集的列表（n = 20,000+）B，其中每个集由A中的行名组成，创建的最佳方法是什么通过唯一值表示B中的集合的向量？

插图

以下是说明此问题的示例：

# Input
A <- data.frame(d = rep("A", 5e6), row.names = as.character(sample(1:5e6)))
B <- list(c("4655297", "3177816", "3328423"), c("2911946", "2829484"), ...) # Size 20,000+

期望的结果是：

# An index of NA represents that the row is not part of any set in B.
> A[,"index", drop = F]
        d index
4655297 A     1
3328423 A     1
2911946 A     2
2829484 A     2
3871770 A    NA
2702914 A    NA
2581677 A    NA
4106410 A    NA
3755846 A    NA
3177816 A     1

天真的尝试

可以使用以下方法来实现类似的效果。

n <- 0
A$index <- NA
lapply(B, function(x){
  n <<- n + 1
  A[x, "index"] <<- n
})

问题

然而，由于多次索引 A，这是不合理的慢（几个小时），并且不是非常 R 风格或优雅。

如何以快速有效的方式产生预期的结果？

score 4 · Accepted Answer

这是一个使用 base 的建议，与您当前的方法相比，它并不算太糟糕。

样本数据：

A <- data.frame(d   = rep("A", 5e6),
                set = sample(c(NA, 1:20000), 5e6, replace = TRUE),
                row.names = as.character(sample(1:5e6)))
B <- split(rownames(A), A$set)

基础方法：

system.time({
A$index <- NA
A[unlist(B), "index"] <- rep(seq_along(B), times = lapply(B, length))
})
#    user  system elapsed 
#   15.30    0.19   15.50

查看：

identical(A$set, A$index)
# TRUE

对于任何更快的东西，我想data.table都会派上用场。

performance - 将集合转换为 R 中的列索引的有效方法是什么？

1 回答 1

Related

Reference