我有两个数据框。首先,一个包含顶点名称列表的查找表:
lookup <- data.frame(Name=c("Bob","Jane"))
然后我有一个看起来像这样的边缘列表:
edges <- data.frame(vertex1 = c("Bob","Bill","Bob","Jane","Bill","Jane","Bob","Jane","Bob","Bill","Bob"
,"Jane","Bill","Jane","Bob","Jane","Jane","Jill","Jane","Susan","Susan"),
edgeID = c(1,1,1,1,1,1,2,2,1,1,1,1,1,1,2,2,3,3,3,3,3),
vertex2 = c("Bill","Bob","Jane","Bob","Jane","Jill","Jane","Bob","Bill","Bob"
,"Jane","Bob","Jane","Bill","Jane","Bob","Jill","Jane","Susan","Jane","Jill"))
对于“查找”表中的每个唯一顶点,我想遍历“边”表并标记查找 $ 名称位于顶点之间的每个 edgeID。
我可以使用以下脚本来做到这一点:
library(igraph)
g <- graph_from_data_frame(edges[c(1, 3, 2)], directed = FALSE)
do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
as.character(lookup$Name),
function(nm) {
z <- c(nm, V(g)$name[distances(g, nm) == 1])
cbind(group = nm, unique(subset(edges, vertex1 %in% z & vertex2 %in% z)))
}
)
)
)
group vertex1 edgeID vertex2
1 Bob Bob 1 Bill
2 Bob Bill 1 Bob
3 Bob Bob 1 Jane
4 Bob Jane 1 Bob
5 Bob Bill 1 Jane
6 Bob Bob 2 Jane
7 Bob Jane 2 Bob
8 Bob Jane 1 Bill
9 Jane Bob 1 Bill
10 Jane Bill 1 Bob
11 Jane Bob 1 Jane
12 Jane Jane 1 Bob
13 Jane Bill 1 Jane
14 Jane Jane 1 Jill
15 Jane Bob 2 Jane
16 Jane Jane 2 Bob
17 Jane Jane 1 Bill
18 Jane Jane 3 Jill
19 Jane Jill 3 Jane
20 Jane Jane 3 Susan
21 Jane Susan 3 Jane
22 Jane Susan 3 Jill
问题是这对于大型边缘列表似乎效率低下。在我的真实数据中,“查找”有 3,263 个观察值,而“边缘”有 167,775,170 个观察值。我已经尝试在具有 16 个内核和 100GB 或 RAM 的 Amazon EC2 实例上运行上面的脚本两天了,但看不到尽头(使用“future_lapply”而不是“lapply”来允许并行处理)。有什么办法可以让这更有效/更快?
这不是我唯一一次需要像这样对边进行分组,我希望找到一种在时间和亚马逊账单方面不那么昂贵的方法。