python - python中的加权随机样本

Question

我正在寻找一个函数的合理定义，该函数weighted_sample不只返回给定权重列表的一个随机索引（类似于

def weighted_choice(weights, random=random):
    """ Given a list of weights [w_0, w_1, ..., w_n-1],
        return an index i in range(n) with probability proportional to w_i. """
    rnd = random.random() * sum(weights)
    for i, w in enumerate(weights):
        if w<0:
            raise ValueError("Negative weight encountered.")
        rnd -= w
        if rnd < 0:
            return i
    raise ValueError("Sum of weights is not positive")

给出一个具有恒定权重的分类分布）但是其中的一个随机样本k，没有替换，就像与random.sample相比random.choice。

就像weighted_choice可以写成

lambda weights: random.choice([val for val, cnt in enumerate(weights)
    for i in range(cnt)])

weighted_sample可以写成

lambda weights, k: random.sample([val for val, cnt in enumerate(weights)
    for i in range(cnt)], k)

但我想要一个不需要我将权重分解为（可能很大）列表的解决方案。

编辑：如果有任何不错的算法可以给我一个直方图/频率列表（与参数格式相同weights）而不是一系列索引，那也将非常有用。

score 8 · Accepted Answer

从你的代码：..

weight_sample_indexes = lambda weights, k: random.sample([val 
        for val, cnt in enumerate(weights) for i in range(cnt)], k)

..我假设权重是正整数，并且“没有替换”是指没有替换解开的序列。

这是基于 random.sample 和 O(log n) 的解决方案__getitem__：

import bisect
import random
from collections import Counter, Sequence

def weighted_sample(population, weights, k):
    return random.sample(WeightedPopulation(population, weights), k)

class WeightedPopulation(Sequence):
    def __init__(self, population, weights):
        assert len(population) == len(weights) > 0
        self.population = population
        self.cumweights = []
        cumsum = 0 # compute cumulative weight
        for w in weights:
            cumsum += w   
            self.cumweights.append(cumsum)  
    def __len__(self):
        return self.cumweights[-1]
    def __getitem__(self, i):
        if not 0 <= i < len(self):
            raise IndexError(i)
        return self.population[bisect.bisect(self.cumweights, i)]

例子

total = Counter()
for _ in range(1000):
    sample = weighted_sample("abc", [1,10,2], 5)
    total.update(sample)
print(sample)
print("Frequences %s" % (dict(Counter(sample)),))

# Check that values are sane
print("Total " + ', '.join("%s: %.0f" % (val, count * 1.0 / min(total.values()))
                           for val, count in total.most_common()))

输出

['b', 'b', 'b', 'c', 'c']
Frequences {'c': 2, 'b': 3}
Total b: 10, c: 2, a: 1

score 3 · Accepted Answer

您要创建的是非均匀随机分布。这样做的一种不好的方法是创建一个巨大的数组，其中输出符号与权重成比例。因此，如果 a 的可能性是 b 的 5 倍，那么您将创建一个数组，其中 a 的数量是 b 的 5 倍。这适用于权重甚至是彼此的倍数的简单分布。如果你想要 99.99% a 和 0.01% b 怎么办？您必须创建 10000 个插槽。

有个更好的方法。所有具有 N 个符号的非均匀分布都可以分解为一系列 n-1 个二进制分布，其中每个分布的可能性均等。

因此，如果您有这样的分解，您首先会通过从 1 - N-1 生成均匀随机数来随机选择二进制分布

u32 dist = randInRange( 1, N-1 ); // generate a random number from 1 to N;

然后说选择的分布是具有两个符号 a 和 b 的二元分布，a 的概率为 0-alpha，b 的概率为 alpha-1：

float f = randomFloat();
return ( f > alpha ) ? b : a;

如何分解任何非均匀随机分布有点复杂。本质上，您创建了 N-1 个“桶”。选择概率最低的符号和概率最高的符号，并将它们的权重按比例分配到第一个二进制分布中。然后删除最小的符号，并删除用于创建此二进制分布的较大符号的权重。并重复这个过程，直到你没有留下任何符号。

如果您想使用此解决方案，我可以为此发布 c++ 代码。

score 0 · Accepted Answer

如果你构建了正确的数据结构来random.sample()进行操作，你根本不需要定义一个新的函数。只需使用random.sample().

在这里，__getitem__()是 O(n)，其中 n 是您拥有的具有权重的不同项目的数量。但它在内存中很紧凑，只(weight, value)需要存储对。我在实践中使用过类似的课程，对于我的目的来说它非常快。请注意，此实现采用整数权重。

class SparseDistribution(object):
    _cached_length = None

    def __init__(self, weighted_items):
        # weighted items are (weight, value) pairs
        self._weighted_items = []
        for item in weighted_items:
            self.append(item)

    def append(self, weighted_item):
        self._weighted_items.append(weighted_item)
        self.__dict__.pop("_cached_length", None)

    def __len__(self):
        if self._cached_length is None:
            length = 0
            for w, v in self._weighted_items:
                length += w
            self._cached_length = length
        return self._cached_length

    def __getitem__(self, index):
        if index < 0 or index >= len(self):
            raise IndexError(index)
        for w, v in self._weighted_items:
            if index < w:
                return v
        raise Exception("Shouldn't have happened")

    def __iter__(self):
        for w, v in self._weighted_items:
            for _ in xrange(w):
                yield v

然后，我们可以使用它：

import random

d = SparseDistribution([(5, "a"), (2, "b")])
d.append((3, "c"))

for num in (3, 5, 10, 11):
    try:
        print random.sample(d, num)
    except Exception as e:
        print "{}({!r})".format(type(e).__name__, str(e))

导致：

['a', 'a', 'b']
['b', 'a', 'c', 'a', 'b']
['a', 'c', 'a', 'c', 'a', 'b', 'a', 'a', 'b', 'c']
ValueError('sample larger than population')

score 0 · Accepted Answer

由于我目前对结果的直方图最感兴趣，因此我想到了以下解决方案numpy.random.hypergeometric（不幸的是，对于和的边界情况，它的行为很糟糕，ngood < 1因此需要单独检查这些情况。）nbad < 1nsample < 1

def weighted_sample_histogram(frequencies, k, random=numpy.random):
    """ Given a sequence of absolute frequencies [w_0, w_1, ..., w_n-1],
    return a generator [s_0, s_1, ..., s_n-1] where the number s_i gives the
    absolute frequency of drawing the index i from an urn in which that index is
    represented by w_i balls, when drawing k balls without replacement. """
    W = sum(frequencies)
    if k > W:
        raise ValueError("Sum of absolute frequencies less than number of samples")
    for frequency in frequencies:
        if k < 1 or frequency < 1:
            yield 0
        else:
            W -= frequency
            if W < 1:
                good = k
            else:
                good = random.hypergeometric(frequency, W, k)
            k -= good
            yield good
    raise StopIteration

我很乐意就如何改进这一点或为什么这可能不是一个好的解决方案提出任何意见。

一个实现这个（和其他加权随机的东西）的 python 包现在在http://github.com/Anaphory/weighted_choice上。

score 0 · Accepted Answer

另一种解决方案

from typing import List, Any
import numpy as np

def weighted_sample(choices: List[Any], probs: List[float]):
    """
    Sample from `choices` with probability according to `probs`
    """
    probs = np.concatenate(([0], np.cumsum(probs)))
    r = random.random()
    for j in range(len(choices) + 1):
        if probs[j] < r <= probs[j + 1]:
            return choices[j]

例子：

aa = [0,1,2,3]
probs = [0.1, 0.8, 0.0, 0.1]
np.average([weighted_sample(aa, probs) for _ in range(10000)])

Out: 1.0993

score -3 · Accepted Answer

样品非常快。因此，除非您有很多兆字节要处理，否则 sample() 应该没问题。

在我的机器上，从 10000000 个长度为 100 的样本中生成 1000 个样本需要 1.655 秒。从 10000000 个元素中遍历 100000 个长度为 100 的样本需要 12.98 秒。

from random import sample,random
from time import time

def generate(n1,n2,n3):
    w = [random() for x in range(n1)]

    print len(w)

    samples = list()
    for i in range(0,n2):
        s = sample(w,n3)
        samples.append(s)

    return samples

start = time()
size_set = 10**7
num_samples = 10**5
length_sample = 100
samples = generate(size_set,num_samples,length_sample)
end = time()

allsum=0
for row in samples:
    sum = reduce(lambda x, y: x+y,row)
    allsum+=sum

print 'sum of all elements',allsum

print '%f seconds for %i samples of %i length %i'%((end-start),size_set,num_sam\
ples,length_sample)

python - python中的加权随机样本

6 回答 6

例子

输出

Related

Reference