r - 对于大型数据集，在 R 中的列表中从索引切换到名称（或其他属性）。(iGraph)

Question

我正在使用 R 中的图形对象（igraph 包）。我应用了一个名为“get.shortest.paths()”的函数，它提供了从给定顶点到图中所有其他顶点的最短路径。该算法返回一个列表，其中列表的每个元素对应一个目标顶点，并包含源和目标之间最短路径上所有顶点的顶点索引。例如;

head(get.shortest.paths(graph, from = V(graph)[1], to = V(graph), mode = "out"))
[[1]]
[1] 0 (source and target are the same)
[[2]]
[1]     0 91835 38405 89704     1
[[3]]
[1]     0 91835 12104 39002 22670     2
[[4]]
[1]     0 62386 36754 89246 31045     3

问题是当我想从顶点索引到顶点名称时。像这样的东西；

[[1]]
[1] "gene 1"
[[2]]
[1]     "gene 1"  "protein 45" "protein 83" "protein 70"     "gene 2"
[[3]]
[1]     "gene 1" "protein 45" "protein 30"  "reaction 2" "protein 404"     "gene 3"
[[4]]
[1]     "gene 1" "protein 4" "reaction 12" "protein 19"  "protein 494"   "gene 4"

我尝试通过使用 lapply() 来做到这一点

path.index.list <-  get.shortest.paths(graph, from = V(graph)[1], to = V(cn), mode = "out")
path.name.list <- lapply(path.index.list, FUN = function(path) V(graph)[path]$name)

...但这需要很长时间。“For”循环需要同样长的时间。事实上，我需要将一个源顶点的索引转换为名称到所有其他 100,000 多个顶点的确切时间是......

system.time(lapply(path.index.list, FUN = function(path) V(graph)[path]$name))
  user  system elapsed
608.62  152.69  761.66

...整个图表大约需要 900 天。

这是“按引用传递”与“按值传递”问题之一吗？如果是这样，有人可以帮助我理解如何解决它吗？我听说过在 R 中使用散列或环境函数来解决这样的问题，有人可以对此发表评论吗？我还听说过 R 中的一些包可以帮助解决这个问题？

基本上，我怎样才能解决这个问题而不必用 C 语言编写代码？

score 0 · Accepted Answer

提前查询顶点名称，并在中进行索引lapply：

names <- V(graph)$name
lapply(path.index.list, FUN = function(path) names[path])

我想这会快得多，因为lapply不必V(graph)每次都构建和名称列表来选择它的子列表。

score 0 · Accepted Answer

是的，我最初使用了 use "Tamás" 描述的 lapply 方法。我每次迭代大约需要 230 秒（每 1000 个项目大约需要 2 秒）。我尝试使用“fastmatch”包与使用矩阵的内存分配相结合，速度实际上下降了。我认为这意味着这更多的是 R 查找项目的速度然后是内存的问题。我需要将其降低到每次迭代少于 6 秒才能真正实用。我想我要去C...

r - 对于大型数据集，在 R 中的列表中从索引切换到名称（或其他属性）。(iGraph)

2 回答 2

Related

Reference