r - 从不同概率向量中采样的有效方法

Question

我正在寻找一种更有效的方法来从整数列表中多次采样 1:n，其中概率向量（也是长度 n）每次都不同。对于 n = 10 的 20 次试验，我知道可以这样做：

probs <- matrix(runif(200), nrow = 20)
answers <- numeric(20)
for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,])

但这调用 sample 10 次只是为了每次得到一个数字，所以这可能不是最快的方法。速度会很有帮助，因为代码会这样做很多次。

非常感谢！

卢克

编辑：非常感谢 Roman，他关于基准测试的想法帮助我找到了一个好的解决方案。我现在已将其移至答案。

score 2 · Accepted Answer

只是为了好玩，我尝试了另外两个版本。您在多大范围内进行抽样？我认为所有这些都非常快并且或多或少等效（我没有包括为您的解决方案创建问题）。很想看到其他人对此有所尝试。

library(rbenchmark)
benchmark(replications = 1000,
          luke = for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,]),
          roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
          roman2 = replicate(20, sample(10, 1, prob = runif(10))))

    test replications elapsed relative user.self sys.self user.child sys.child
1   luke         1000    0.41    1.000      0.42        0         NA        NA
2  roman         1000    0.47    1.146      0.46        0         NA        NA
3 roman2         1000    0.47    1.146      0.44        0         NA        NA

score 1 · Accepted Answer

这是我发现的另一种方法。它很快，但不如简单地使用 for 循环多次调用 sample 快。我最初认为它非常好，但我错误地使用了 benchmark()。

luke2 = function(probs) { # takes a matrix of probability vectors, each in its own row
                probs <- probs/rowSums(probs) 
                probs <- t(apply(probs,1,cumsum)) 
                answer <- rowSums(probs - runif(nrow(probs)) < 0) + 1 
                return(answer)  }

它是这样工作的：将概率想象成从 0 到 1 的数轴上排列的各种长度的线。大概率将比小概率占据更多的数轴。然后，您可以通过在数轴上选择一个随机点来选择结果 - 大概率将更有可能被选中。这种方法的优点是您可以在一次调用 runif() 中滚动所有需要的随机数，而不是像函数 luke、roman 和 roman2 那样一遍又一遍地调用 sample。但是，看起来额外的数据处理会减慢它的速度，而且成本远远抵消了这种好处。

library(rbenchmark)
probs <- matrix(runif(2000), ncol = 10)
answers <- numeric(200)

benchmark(replications = 1000,
          luke = for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,]),
          luke2 = luke2(probs),
          roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
          roman2 = replicate(20, sample(10, 1, prob = runif(10))))
              roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
              roman2 = replicate(20, sample(10, 1, prob = runif(10))))

    test replications elapsed relative user.self sys.self user.child sys.child
    1   luke         1000   0.171    1.000     0.166    0.005          0         0
    2  luke2         1000   0.529    3.094     0.518    0.012          0         0
    3  roman         1000   1.564    9.146     1.513    0.052          0         0
    4 roman2         1000   0.225    1.316     0.213    0.012          0         0

出于某种原因，apply() 在您添加更多行时表现非常糟糕。我不明白为什么，因为我认为它是 for() 的包装器，因此 roman() 的性能应该类似于 luke()。

r - 从不同概率向量中采样的有效方法

2 回答 2

Related

Reference