1

我有两个数据框。首先,一个包含顶点名称列表的查找表:

lookup <- data.frame(Name=c("Bob","Jane"))

然后我有一个看起来像这样的边缘列表:

edges <- data.frame(vertex1 = c("Bob","Bill","Bob","Jane","Bill","Jane","Bob","Jane","Bob","Bill","Bob"
                              ,"Jane","Bill","Jane","Bob","Jane","Jane","Jill","Jane","Susan","Susan"),
                  edgeID = c(1,1,1,1,1,1,2,2,1,1,1,1,1,1,2,2,3,3,3,3,3),
                  vertex2 = c("Bill","Bob","Jane","Bob","Jane","Jill","Jane","Bob","Bill","Bob"
                              ,"Jane","Bob","Jane","Bill","Jane","Bob","Jill","Jane","Susan","Jane","Jill"))

对于“查找”表中的每个唯一顶点,我想遍历“边”表并标记查找 $ 名称位于顶点之间的每个 edgeID。

我可以使用以下脚本来做到这一点:

library(igraph)

g <- graph_from_data_frame(edges[c(1, 3, 2)], directed = FALSE)
do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      as.character(lookup$Name),
      function(nm) {
        z <- c(nm, V(g)$name[distances(g, nm) == 1])
        cbind(group = nm, unique(subset(edges, vertex1 %in% z & vertex2 %in% z)))
      }
    )
  )
)
   group vertex1 edgeID vertex2
1    Bob     Bob      1    Bill
2    Bob    Bill      1     Bob
3    Bob     Bob      1    Jane
4    Bob    Jane      1     Bob
5    Bob    Bill      1    Jane
6    Bob     Bob      2    Jane
7    Bob    Jane      2     Bob
8    Bob    Jane      1    Bill
9   Jane     Bob      1    Bill
10  Jane    Bill      1     Bob
11  Jane     Bob      1    Jane
12  Jane    Jane      1     Bob
13  Jane    Bill      1    Jane
14  Jane    Jane      1    Jill
15  Jane     Bob      2    Jane
16  Jane    Jane      2     Bob
17  Jane    Jane      1    Bill
18  Jane    Jane      3    Jill
19  Jane    Jill      3    Jane
20  Jane    Jane      3   Susan
21  Jane   Susan      3    Jane
22  Jane   Susan      3    Jill

问题是这对于大型边缘列表似乎效率低下。在我的真实数据中,“查找”有 3,263 个观察值,而“边缘”有 167,775,170 个观察值。我已经尝试在具有 16 个内核和 100GB 或 RAM 的 Amazon EC2 实例上运行上面的脚本两天了,但看不到尽头(使用“future_lapply”而不是“lapply”来允许并行处理)。有什么办法可以让这更有效/更快?

这不是我唯一一次需要像这样对边进行分组,我希望找到一种在时间和亚马逊账单方面不那么昂贵的方法。

4

1 回答 1

1

我认为您可以edges先缩小原始 data.frame,然后可以避免uniquelapply每次迭代中使用 inside。

下面的代码可能会加快一点,但不确定它如何在您的真实数据中获得收益。

edges.unique <- unique(edges[c(1, 3, 2)])
g <- graph_from_data_frame(edges.unique, directed = FALSE)
do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      lookup$Name,
      function(nm) {
        z <- colnames(d <- distances(g, nm))[which(d < 2)]
        cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
      }
    )
  )
)

更新

edges.unique <- unique(
  transform(
    edges[c("vertex1", "vertex2", "edgeID")],
    vertex1 = ifelse(vertex1 < vertex2, vertex1, vertex2),
    vertex2 = ifelse(vertex1 < vertex2, vertex2, vertex1)
  )
)
g <- graph_from_data_frame(edges.unique, directed = FALSE)
res <- do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      lookup$Name,
      function(nm) {
        z <- colnames(d <- distances(g, nm))[which(d < 2)]
        cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
      }
    )
  )
)

> res
   group vertex1 vertex2 edgeID
1    Bob    Bill     Bob      1
2    Bob     Bob    Jane      1
3    Bob    Bill    Jane      1
4    Bob     Bob    Jane      2
5   Jane    Bill     Bob      1
6   Jane     Bob    Jane      1
7   Jane    Bill    Jane      1
8   Jane    Jane    Jill      1
9   Jane     Bob    Jane      2
10  Jane    Jane    Jill      3
11  Jane    Jane   Susan      3
12  Jane    Jill   Susan      3

当你输入时plot(g),你会看到简化如下 在此处输入图像描述

于 2021-03-01T16:06:55.017 回答