python - Python中非常大的数字的二项式检验

Question

我需要在 Python 中进行二项式测试，以允许计算 10000 左右的“n”个数字。

我已经使用 scipy.misc.comb 实现了一个快速的 binomial_test 函数，但是，它几乎限制在 n = 1000 左右，我猜是因为它在计算阶乘或组合本身时达到了最大的可表示数字。这是我的功能：

from scipy.misc import comb
def binomial_test(n, k):
    """Calculate binomial probability
    """
    p = comb(n, k) * 0.5**k * 0.5**(n-k)
    return p

我如何使用本机 python（或 numpy、scipy ...）函数来计算二项式概率？如果可能的话，我需要 scipy 0.7.2 兼容代码。

非常感谢！

score 10 · Accepted Answer

编辑添加此评论：请注意，正如 Daniel Stutzbach 所提到的，“二项式检验”可能不是原始海报所要求的（尽管他确实使用了这个表达式）。他似乎在询问二项分布的概率密度函数，这不是我在下面建议的。

你试过 scipy.stats.binom_test 吗？

rbp@apfelstrudel ~$ python
Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39) 
[GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from scipy import stats
>>> print stats.binom_test.__doc__

    Perform a test that the probability of success is p.

    This is an exact, two-sided test of the null hypothesis
    that the probability of success in a Bernoulli experiment
    is `p`.

    Parameters
    ----------
    x : integer or array_like
        the number of successes, or if x has length 2, it is the
        number of successes and the number of failures.
    n : integer
        the number of trials.  This is ignored if x gives both the
        number of successes and failures
    p : float, optional
        The hypothesized probability of success.  0 <= p <= 1. The
        default value is p = 0.5

    Returns
    -------
    p-value : float
        The p-value of the hypothesis test

    References
    ----------
    .. [1] http://en.wikipedia.org/wiki/Binomial_test


>>> stats.binom_test(500, 10000)
4.9406564584124654e-324

添加文档链接的小编辑：http: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom_test.html#scipy.stats.binom_test

顺便说一句：适用于 scipy 0.7.2，以及当前的 0.8 dev。

score 6 · Accepted Answer

任何看起来像comb(n, k) * 0.5**k * 0.5**(n-k)的解决方案都不适用于大型n. 在大多数（所有？）平台上，Python 浮点数可以存储的最小值约为 2**-1022。对于 largen-k或 large k，右侧将四舍五入为 0。同样，comb(n, k) 可以增长得如此之大，以至于它不适合浮点数。

一种更稳健的方法是将概率密度函数计算为累积分布函数中两个连续点之间的差异，可以使用正则化不完全 beta 函数计算（查看 SciPy 的“特殊函数”包）。数学上：

pdf(p, n, k) = cdf(p, n, k) - cdf(p, n, k-1)

另一种选择是使用Normal 近似值，这对于大型n. 如果速度是一个问题，这可能是要走的路：

from math import *

def normal_pdf(x, m, v):
    return 1.0/sqrt(2*pi*v) * exp(-(x-m)**2/(2*v))

def binomial_pdf(p, n, k):
    if n < 100:
        return comb(n, k) * p**k * p**(n-k)  # Fall back to your current method
    return normal_pdf(k, n*p, n*p*(1.0-p))

我没有测试过代码，但这应该会给你一个大致的想法。

score 3 · Accepted Answer

GMPY还支持扩展精度浮点计算。例如：

>>> from gmpy import *
>>>
>>> def f(n,k,p,prec=256):
...     return mpf(comb(n,k),prec) * mpf(p,prec)**k * mpf(1-p,prec)**(n-k)
...
>>> print(f(1000,500,0.5))
0.0252250181783608019068416887621024545529410193921696384762532089115753731615931
>>>

我指定了 256 位的浮点精度。顺便说一句，source forge 版本已经过时了。当前版本在 code.google.com 上维护并支持 Python 3.x。（免责声明：我是 gmpy 的当前维护者。）

案例

score 1 · Accepted Answer

我会研究GNU Multi-Precision 包(gmpy)，它允许您执行任意精度计算：您可能会这样做：

comb(n, k, exact=1)/2**k/2**(n-k)

但使用 gmpy 的长整数。

事实上，如果你使用精确的整数计算，你可以很容易地达到 n=10000的组合部分；为此，您必须使用：

comb(n, k, exact=1)

而不是溢出的浮点近似comb(n, k)。

但是，正如原始海报所指出的，返回的（长）整数可能太长而无法乘以浮点数！

此外，很快就会遇到另一个问题：0.5**1000=9.3…e-302 已经非常接近浮点下溢…

总而言之：如果您真的需要所有k的精确结果n~10,000，您需要使用与原始帖子中的公式不同的方法，该方法受到双精度浮点运算的限制。如上所述使用 gmpy 可能是一种解决方案（未经测试！）。

score 0 · Accepted Answer

不是专门的 Python 解决方案，但如果您可以处理小的分数错误，您可以尝试使用斯特林的 n 近似值！：

梳（n，k）= n！/（k！*（nk）！），其中n！对于大 n，大约是 sqrt(2*Pi n) (n/e)^n。

对于 n>1000，小数误差应该非常小。

对于 n 较大的概率计算，对中间结果使用对数：

日志 p = log(comb(n, k)) - n * log(2)

p = exp(log(p))

score -1 · Accepted Answer

#  This imports the array function form numpy

from numpy import array

    # the following defines the factorial function to be used in the binomial commands/
# n+1 is used in the range to include the nth term

def factorial (n):
    f=1
    for x in range(1,n+1):
        f=f*(x)
    return f

# The follwong calculates the binomial coefficients for given values of n & k 
def binomial (n,k):
    b=1
    b=(factorial(n)/(factorial(k)*factorial(n-k)))
    return int(b)

# the following lines define the pascal triangle , and print it out for 20 rows./
# in order to include nth term, the n +1 term needs to be in the range. The commands/
# append the next binomial coeficiant to a raw first and then append rows to the triangle/
# and prints a 20 row size pascal triangle
def pascal(T):
    triangle=[]
    for n in range(T):
        r=[]
        for k in range(n+1):
            r.append(binomial(n,k))
        triangle.append(r)
    return triangle

for r in pascal(20):
    print((r))

python - Python中非常大的数字的二项式检验

6 回答 6

Related

Reference