r - 如何改进这个哈希函数

Question

有没有办法提高这个哈希的初始化速度？目前这在我的机器上大约需要 20 分钟。

#prepare hash()
hash <- list();

mappedV <- # matrix with more than 200,000 elements
for( i in 1:nrow(mappedV) ) {
  hash[[paste(mappedV[i,], collapse = '.')]] <- 0;
}

在这段代码之前，我使用了一个矩阵，但这花了我3个多小时。所以我不会抱怨这 20 分钟。我只是好奇是否有更好的选择。我使用散列函数来计算 200,000 个可能的组合中的每一个。

PS：并发可能是一种选择。但这并不能改善散列。

score 5 · Accepted Answer

您通常可以通过预先分配所需长度的列表来节省大量时间，而不是在每次迭代时增加它。

看哪：

X <- vector(mode="list", 1e5)
Y <- list()

system.time(for(i in 1:1e5) X[[i]] <- 0)
#    user  system elapsed 
#     0.3     0.0     0.3 
system.time(for(i in 1:1e5) Y[[i]] <- 0)
#    user  system elapsed 
#   48.84    0.05   49.34 
identical(X,Y)
# [1] TRUE

因为整个列表Y在每次添加时都会被复制，所以添加额外元素只会随着它的大小增长而变得越来越慢。

score 4 · Accepted Answer

你也可以environment作为一个哈希......让我们看看：

mappedV <- matrix(1:100000, ncol=5)
hash1 <- list()
hash2 <- new.env(hash=TRUE)

system.time(for(i in 1:nrow(mappedV)) hash1[[paste(mappedV[i,], collapse = '.')]] <- 0)
#   user  system elapsed 
# 19.263   1.321  21.634 

system.time(for(i in 1:nrow(mappedV)) hash2[[paste(mappedV[i,], collapse = '.')]] <- 0)
#   user  system elapsed 
#  0.426   0.002   0.430

更新以回答“需要注意的事项”

正如 Josh O'Brien 指出的那样，这非常快，因为修改时不会复制整个环境。看起来很有用，对吧？

当您期望这些对象在其不变性方面表现得与您习惯的大多数其他对象一样时，可能会出现“问题”。当在environment某处修改时，它会在任何地方对其进行更改。例如，如果我们将传递给environment一个删除其所有元素的函数，environment则到处都会被冲洗掉，而列表不会。

见证：

hash1 <- list(a=1:10, b=rnorm(10))
hash2 <- new.env(hash=TRUE)
hash2$a <- 1:10
hash2$b <- rnorm(10)

danger <- function(x, axe) {
  for (wut in axe) x[[wut]] <- NULL
}

## the list is safe
danger(hash1, names(hash1))
hash1
# $a
#  [1]  1  2  3  4  5  6  7  8  9 10
#
# $b
# [1] -0.8575287  0.5248522  0.6957204 -0.7116208
# [2]  0.5536749  0.9860218 -1.2598799 -1.1054205
# [3]  0.3472648

## The environment gets mutilated
danger(hash2, names(hash1))
as.list(hash2)
# $a
# NULL
# 
# $b
# NULL

score 2 · Accepted Answer

它不像使用环境那么快，但是有一个直接的矢量化解决方案来解决这个问题：

mappedV <- matrix(1:100000, ncol = 5)
hashes <- apply(mappedV, 1, paste, collapse = ".")

hash <- list()
hash[hashes] <- 0

或者当然你可以把一个 0 的向量变成列表并命名它：

hash <- as.list(rep(0, length = length(hashes)))
names(hash) <- hashes

这在我的电脑上需要 <0.001s。

r - 如何改进这个哈希函数

3 回答 3

更新以回答“需要注意的事项”

Related

Reference