2

I'd like to read more about an algorithm that's used in R for unequal probability sampling, but after a few hours of searching I haven't been able to turn anything up on it. I thought it might have been an Art of Computer Programming algorithm, but I haven't been able to substantiate that either. The particular function in R's random.c is called ProbSampleNoReplace().

Given a vector of probabilities prob[] and a desired sample size n with a vector of selected items ans[]

For each element j in prob[] assign an index perm[j]
Sort the list in order of probability value, largest first

totalmass = 1
For (h=0, n1= n-1, h<nans, h++,n1-- )
    rt = totalmass * rand(in 0:1)
    mass = 0

    **sum the probabilities, largest first, until the sum is bigger than rt**
    for(j=0;j<n1;j++)
        mass += prob[j]
        if rt <= mass then break

    ans[h] = perm[j]
    **reduce size of totalmass to reflect removed item**
    totalmass -= prob[j]

    **reset the indices to be sequential**
    for(k=j, k<n1, k++)
        prob[k] = prob[k+1]
        perm[k] = perm[k+1]
4

1 回答 1

1

sample函数支持不等概率参数。对于我们这些不读 C 的人来说,您的代码片段并不清楚它的意图。

> table( sample(1:4, 100, repl=TRUE, prob=4:1) )

 1  2  3  4 
46 23 24  7 

还有另一个可能有用的 SO Q&A(通过带有参数的 SO 搜索找到):

random.c ProbSampleNoReplace

无需更换即可实现更快的加权采样

于 2013-03-17T22:03:27.237 回答