algorithm - 具有指定结果概率的值的随机样本

Question

想象一下，我们有四个符号 - 'a'、'b'、'c'、'd'。我们还有四个给定的符号出现在函数输出中的概率 - P1、P2、P3、P4（它们的总和等于1）。如何实现一个函数，该函数会生成其中三个符号的随机样本，例如结果符号以那些指定的概率出现在其中？

示例：'a'、'b'、'c'和'd'的概率分别为9/30、8/30、7/30和6/30。该函数输出这四个符号中任意三个的各种随机样本：'abc'、'dca'、'bad'等等。我们多次运行此函数，计算每个符号在其输出中出现的次数。最后，为'a'存储的计数值除以输出的符号总数应收敛到9/30，对于'b'，对于'c'到7/30，对于'd'到6/30。

例如，该函数生成 10 个输出：

adc
dab
bca
dab
dba
cab
dcb
acd
cab
abc

其中 30 个符号中包含 9 个'a'， 8 个'b'， 7 个'c'和 6 个'd'。当然，这是一个理想主义的例子，因为只有当样本数量大得多时，这些值才会收敛——但它应该有望传达这个想法。

显然，这一切只有在两个概率都不大于 1/3 时才有可能，因为每个单个样本输出总是包含三个不同的符号。如果无法满足所提供的值，则函数可以进入无限循环或以其他方式行为异常。

注意：该函数显然应该使用 RNG，否则应该是无状态的。每个新的调用都应该独立于之前的任何调用，除了 RNG 状态。

编辑：即使描述提到从 4 个值中选择 3 个，理想情况下该算法应该能够处理任何样本大小。

score 1 · Accepted Answer

您的问题未确定。

如果我们为每个允许的三个字母字符串分配一个概率，p(abc)、p(abd)、p(acd) 等 xtc，我们可以生成一系列方程

eqn1: p(abc) + p(abd) + ... others with a "a" ... = p1
  ...
  ...
eqn2: p(abd) + p(acd) + ... others with a "d" ... = p4

这比方程式有更多的未知数，因此有很多解决方法。找到解决方案后，无论您选择哪种方法（如果您是我，请使用单纯形算法），使用@alestanis 描述的轮盘赌方法从每个字符串的概率中采样。

from numpy import *

# using cvxopt-1.1.5
from cvxopt import matrix, solvers 

###########################
# Functions to do some parts

# function to find all valid outputs
def perms(alphabet, length):
    if length == 0:
        yield ""
        return
    for i in range(len(alphabet)):
        val1 = alphabet[i]
        for val2 in perms(alphabet[:i]+alphabet[i+1:], length-1):
            yield val1 + val2


# roulette sampler
def roulette_sampler(values, probs):
    # Create cumulative prob distro
    probs_cum = [sum(probs[:i+1]) for i in range(n_strings)]
    def fun():
        r = random.rand()
        for p,s in zip(probs_cum, values):
            if r < p:
                return s
        # in case of rounding error
        return values[-1]
    return fun


############################
#    Main Part



# create list of all valid strings

alphabet = "abcd"
string_length = 3
alpha_probs = [string_length*x/30. for x in range(9,5,-1)]

# show probs
for a,p in zip(alphabet, alpha_probs):
    print "p("+a+") =",p




# all valid outputs for this particular case
strings = [perm for perm in perms(alphabet, string_length)]
n_strings = len(strings)

# constraints from probabilities p(abc) + p(abd) ... = p(a)
contains = array([[1. if s.find(a) >= 0 else 0. for a in alphabet] for s in strings])
#both = concatenate((contains,wons), axis=1).T # hacky, but whatever
#A = matrix(both)
#b = matrix(alpha_probs + [1.])
A = matrix(contains.T)
b = matrix(alpha_probs)

#also need to constrain to [0,1]
wons = array([[1. for s in strings]])
G = matrix(concatenate((eye(n_strings),wons,-eye(n_strings),-wons)))
h = matrix(concatenate((ones(n_strings+1),zeros(n_strings+1))))

## target matricies for approx KL divergence
# uniform prob over valid outputs
u = 1./len(strings)
P = matrix(eye(n_strings))
q = -0.5*u*matrix(ones(n_strings))
# will minimise p^2 - pq for each p val equally


# Do convex optimisation
sol = solvers.qp(P,q,G,h,A,b)
probs = array(sol['x'])

# Print ouput
for s,p in zip(strings,probs):
    print "p("+s+") =",p
checkprobs = [0. for char in alphabet]
for a,i in zip(alphabet, range(len(alphabet))):
    for s,p in zip(strings,probs):
        if s.find(a) > -1:
            checkprobs[i] += p
    print "p("+a+") =",checkprobs[i]
print "total =",sum(probs)

# Create the sampling function
rndstring = roulette_sampler(strings, probs)


###################
# Verify

print "sampling..."
test_n = 1000
output = [rndstring() for i in xrange(test_n)]

# find which one it is
sampled_freqs = []
for char in alphabet:
    n = 0
    for val in output:
        if val.find(char) > -1:
            n += 1
    sampled_freqs += [n]

print "plotting histogram..."
import matplotlib.pyplot as plt
plt.bar(range(0,len(alphabet)),array(sampled_freqs)/float(test_n), width=0.5)
plt.show()

编辑：Python代码

score 1 · Accepted Answer

假设一个单词的长度总是比符号数小一，下面的 C# 代码可以完成这项工作：

using System;
using System.Collections.Generic;
using System.Linq;
using MathNet.Numerics.Distributions;

namespace RandomSymbols
{
    class Program
    {
        static void Main(string[] args)
        {
            // Sample case:  Four symbols with the following distribution, and 10000 trials
            double[] distribution = { 9.0/30, 8.0/30, 7.0/30, 6.0/30 };
            int trials = 10000;

            // Create an array containing all of the symbols
            char[] symbols = Enumerable.Range('a', distribution.Length).Select(s => (char)s).ToArray();

            // We're assuming that the word length is always one less than the number of symbols
            int wordLength = symbols.Length - 1;

            // Calculate the probability of each symbol being excluded from a given word
            double[] excludeDistribution = Array.ConvertAll(distribution, p => 1 - p * wordLength);

            // Create a random variable using the MathNet.Numerics library
            var randVar = new Categorical(excludeDistribution);
            var random = new Random();
            randVar.RandomSource = random;

            // We'll store all of the words in an array
            string[] words = new string[trials];

            for (int t = 0; t < trials; t++)
            {
                // Start with a word containing all of the symbols
                var word = new List<char>(symbols);

                // Remove one of the symbols
                word.RemoveAt(randVar.Sample());

                // Randomly permute the remainder
                for (int i = 0; i < wordLength; i++)
                {
                    int swapIndex = random.Next(wordLength);
                    char temp = word[swapIndex];
                    word[swapIndex] = word[i];
                    word[i] = temp;
                }

                // Store the word
                words[t] = new string(word.ToArray());
            }

            // Display words
            Array.ForEach(words, w => Console.WriteLine(w));

            // Display histogram
            Array.ForEach(symbols, s => Console.WriteLine("{0}: {1}", s, words.Count(w => w.Contains(s))));
        }

    }
}

更新：以下是 rici 概述的方法的 C 实现。棘手的部分是计算他提到的阈值，这是我用递归完成的。

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// ****** Change the following for different symbol distributions, word lengths, and number of trials ******
double targetFreqs[] = {10.0/43, 9.0/43, 8.0/43, 7.0/43, 6.0/43, 2.0/43, 1.0/43 };
const int WORDLENGTH = 4;
const int TRIALS = 1000000;
// *********************************************************************************************************

const int SYMBOLCOUNT = sizeof(targetFreqs) / sizeof(double);
double inclusionProbs[SYMBOLCOUNT];

double probLeftToIncludeTable[SYMBOLCOUNT][SYMBOLCOUNT];

// Calculates the probability that there will be n symbols left to be included when we get to the ith symbol.
double probLeftToInclude(int i, int n)
{
    if (probLeftToIncludeTable[i][n] == -1)
    {
        // If this is the first symbol, then the number of symbols left to be included is simply the word length.
        if (i == 0)
        {
            probLeftToIncludeTable[i][n] = (n == WORDLENGTH ? 1.0 : 0.0);
        }
        else
        {
            // Calculate the probability based on the previous symbol's probabilities.
            // To get to a point where there are n symbols left to be included, either there were n+1 symbols left
            // when we were considering that previous symbol and we included it, leaving n,
            // or there were n symbols left and we didn't included it, also leaving n.
            // We have to take into account that the previous symbol may have been manditorily included.
             probLeftToIncludeTable[i][n] = probLeftToInclude(i-1, n+1) * (n == SYMBOLCOUNT-i ? 1.0 : inclusionProbs[i-1])
                + probLeftToInclude(i-1, n) * (n == 0 ? 1.0 : 1 - inclusionProbs[i-1]);
        }
    }
    return probLeftToIncludeTable[i][n];
}

// Calculates the probability that the ith symbol won't *have* to be included or *have* to be excluded.
double probInclusionIsOptional(int i)
{
    // The probability that inclusion is optional is equal to 1.0
    // minus the probability that none of the remaining symbols can be included
    // minus the probability that all of the remaining symbols must be included.
    return 1.0 - probLeftToInclude(i, 0) - probLeftToInclude(i, SYMBOLCOUNT - i);
}

// Calculates the probability with which the ith symbol should be included, assuming that
// it doesn't *have* to be included or *have* to be excluded.
double inclusionProb(int i)
{
    // The following is derived by simple algebra:
    // Unconditional probability = (1.0 * probability that it must be included) + (inclusionProb * probability that inclusion is optional)
    // therefore...
    // inclusionProb = (Unconditional probability - probability that it must be included) / probability that inclusion is optional
    return (targetFreqs[i]*WORDLENGTH - probLeftToInclude(i, SYMBOLCOUNT - i)) / probInclusionIsOptional(i);
}

int main(int argc, char* argv[])
{
    srand(time(NULL));

    // Initialize inclusion probabilities
    for (int i=0; i<SYMBOLCOUNT; i++)
        for (int j=0; j<SYMBOLCOUNT; j++)
            probLeftToIncludeTable[i][j] = -1.0;

    // Calculate inclusion probabilities
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        inclusionProbs[i] = inclusionProb(i);
    }

    // Histogram
    int histogram[SYMBOLCOUNT];
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        histogram[i] = 0;
    }

    // Scratchpad for building our words
    char word[WORDLENGTH+1];
    word[WORDLENGTH] = '\0';

    // Run trials
    for (int t=0; t<TRIALS; t++)
    {
        int included = 0;

        // Build the word by including or excluding symbols according to the problem constraints
        // and the probabilities in inclusionProbs[].
        for (int i=0; i<SYMBOLCOUNT && included<WORDLENGTH; i++)
        {
            if (SYMBOLCOUNT - i == WORDLENGTH - included // if we have to include this symbol
                || (double)rand()/(double)RAND_MAX < inclusionProbs[i]) // or if we get a lucky roll of the dice
            {
                word[included++] = 'a' + i;
                histogram[i]++;
            }
        }

        // Randomly permute the word
        for (int i=0; i<WORDLENGTH; i++)
        {
            int swapee = rand() % WORDLENGTH;
            char temp = word[swapee];
            word[swapee] = word[i];
            word[i] = temp;
        }

        // Uncomment the following to show each word
        // printf("%s\r\n", word);
    }

    // Show the histogram
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        printf("%c: target=%d, actual=%d\r\n", 'a'+i, (int)(targetFreqs[i]*WORDLENGTH*TRIALS), histogram[i]);
    }

    return 0;
}

score 1 · Accepted Answer

我认为这是一个非常有趣的问题。我不知道一般的解决方案，但在样本大小为 n-1 的情况下（如果有解决方案）很容易解决，因为正好有 n 个可能的样本，每个样本都对应于一个不存在的样本的元素。

假设我们正在从大小为 4 的宇宙中寻找大小为 3 的样本中的 F a = 9/30、F b = 8/30、F c = 7/30、F d = 6/30 _，就像_在OP_中一样_。我们可以通过选择不包含给定对象的样本将这些频率中的每一个直接转换为样本频率。例如，我们希望选择的对象中有 9/30 是a's；a一个样本中不能有多个符号，并且样本中总是有三个符号；因此，9/10 的样本必须包含a，而 1/10 的样本不能包含a。但是只有一个可能的样本不包含a: bcd。所以 10% 的样本必须是bcd. 同样，20% 必须是acd; 30%abd和 40% abc。（或者，更一般地说，F _ā = 1 - (n-1)F _a其中 F _ā是不包括的（唯一）样本的频率a）

我不禁想到，这种观察与生成独特样本的经典方法之一相结合可以解决一般问题。但我没有那个解决方案。对于它的价值，我正在考虑的算法如下：

To select a random sample of size k out of a universe U of n objects:
1) Set needed = k; available = n.
2) For each element in U, select a random number in the range [0, 1).
3) If the random number is less than k/n:
     3a) Add the element to the sample.
     3b) Decrement needed by 1. If it reaches 0, we're finished.
4) Decrement available, and continue with the next element in U.

所以我的想法是应该可以通过在步骤 3 中更改阈值来操纵元素的频率，使其以某种方式成为相应元素所需频率的函数。

score 0 · Accepted Answer

为此，您必须使用一个临时数组来存储概率的累积总和。

在您的示例中，概率分别为 9/30、8/30、7/30 和 6/30。然后你应该有一个数组：

values = {'a', 'b', 'c', 'd'}
proba = {9/30, 17/30, 24/30, 1}

然后你选择一个随机数r并[0, 1]这样做：

char chooseRandom() {
    int i = 0;
    while (r > proba[i])
        ++i; 

    return values[i];
}

algorithm - 具有指定结果概率的值的随机样本

4 回答 4

Related

Reference