当总体规模远大于样本规模时,上述算法变得低效,因为它们的复杂度为O ( n ),n是总体规模。
当我还是学生的时候,我写了一些无需替换的均匀采样算法,它们的平均复杂度为O ( s log s ),其中s是样本大小。这是二叉树算法的代码,平均复杂度为O ( s log s ),在 R 中:
# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
{
s=as.integer(size)
if (s>n) {
stop("Sample size is greater than the number of items to choose from")
}
# upv=integer(s) #level up edge is pointing to
leftv=integer(s) #left edge is poiting to; must be filled with zeros
rightv=integer(s) #right edge is pointig to; must be filled with zeros
samp=integer(s) #the sample
ordn=integer(s) #relative ordinal number
ordn[1L]=1L #initial value for the root vertex
samp[1L]=sample(n,1L)
if (s > 1L) for (j in 2L:s) {
curn=sample(n-j+1L,1L) #current number sampled
curordn=0L #currend ordinal number
v=1L #current vertice
from=1L #how have come here: 0 - by left edge, 1 - by right edge
repeat {
curordn=curordn+ordn[v]
if (curn+curordn>samp[v]) { #going down by the right edge
if (from == 0L) {
ordn[v]=ordn[v]-1L
}
if (rightv[v]!=0L) {
v=rightv[v]
from=1L
} else { #creating a new vertex
samp[j]=curn+curordn
ordn[j]=1L
# upv[j]=v
rightv[v]=j
break
}
} else { #going down by the left edge
if (from==1L) {
ordn[v]=ordn[v]+1L
}
if (leftv[v]!=0L) {
v=leftv[v]
from=0L
} else { #creating a new vertex
samp[j]=curn+curordn-1L
ordn[j]=-1L
# upv[j]=v
leftv[v]=j
break
}
}
}
}
return(samp)
}
该算法的复杂性在:Rouzankin, PS;Voytishek, AV 关于随机选择算法的成本。蒙特卡罗方法应用程序。5 (1999), 没有。1,39-54。
http://dx.doi.org/10.1515/mcma.1999.5.1.39
如果您发现该算法有用,请提供参考。
另见:P. Gupta,GP Bhattacharjee。(1984) 一种有效的无替换随机抽样算法。国际计算机数学杂志 16:4,第 201-209 页。DOI: 10.1080/00207168408803438
Teuhola, J. 和 Nevalainen, O. 1982。两种有效的无替换随机抽样算法。/IJCM/, 11(2): 127–140。DOI: 10.1080/00207168208803304
在上一篇论文中,作者使用哈希表并声称他们的算法具有O ( s ) 复杂度。还有一种更快的哈希表算法,很快就会在 pqR(pretty quick R)中实现:
https ://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html