algorithm - 将一组大整数转换为一组小整数

Question

我们如何重新编码一组严格递增（或严格递减）的正整数 P，以减少我们集合中的整数之间可能出现的正整数的数量？

我们为什么要这样做：假设我们要对 P 进行随机抽样，但是 1.) P 太大而无法枚举，并且 2.) P 的成员以非随机方式相关，但以一种太复杂而无法抽样的方式经过。但是，当我们看到 P 的成员时，我们就知道它了。假设我们知道 P[0] 和 P[n]，但无法接受枚举所有 P 或准确理解 P 成员之间关系的想法。同样，出现在 P[0] 和 P[n] 之间的所有可能整数的数量比 P 的大小大很多倍，这使得随机抽取 P 的成员的机会非常小。

示例：设 P[0] = 2101010101 & P[n] = 505050505。现在，也许我们只对 P[0] 和 P[n] 之间具有特定质量的整数感兴趣（例如 P[x 中的所有整数） ] 总和为 Q 或更少，P 的每个成员的最大整数为 7 或更少）。因此，并非所有正整数 P[n] <= X <= P[0] 都属于 P。我感兴趣的 P 在下面的评论中讨论。

我已经尝试过：如果 P 是一个严格递减的集合，并且我们知道 P[0] 和 P[n]，那么我们可以将每个成员视为从 P[0] 中减去它。这样做可能会大大减少每个数字，并将每个成员保持为唯一的整数。对于我感兴趣的 P（如下），可以将 P 的每个减小值视为除以公分母 (9,11,99)，这会减少 P 成员之间可能的整数数量。我已经发现结合使用，这些方法将所有 P[0] <= X <= P[n] 的集合减少了几个数量级，从而有机会从所有正整数 P[n 中随机抽取 P 的成员] <= X <= P[0] 仍然很小。

注意：应该清楚，我们必须知道一些关于 P 的信息。如果我们不知道，那基本上意味着我们不知道我们在寻找什么。当我们随机抽样 P[0] 和 P[n] 之间的整数（是否重新编码）时，如果确实如此，我们需要能够说“是的，它属于 P。”。

一个好的答案可以大大增加我开发的计算算法的实际应用。评论 2 中给出了我感兴趣的那种 P 的例子。我坚持给予应有的信任。

score 1 · Accepted Answer

虽然最初的问题是询问有关整数编码的非常通用的场景，但我认为不太可能存在一种完全通用的方法。例如，如果 P[i] 或多或少是随机的（从信息论的角度来看），如果有什么可行的话，我会感到惊讶。

因此，相反，让我们将注意力转向 OP 的实际问题，即生成整数 N 的分区，其中正好包含 K 个部分。当将组合对象编码为整数时，我们应该尽可能多地保留组合结构。为此，我们求助于Nijenhuis 和 Wilf的经典文本 组合算法，特别是第 13 章。事实上，在本章中，他们展示了一个框架来枚举和采样许多组合族——包括 N 的分区，其中最大的部分等于K。利用众所周知的具有K个部分的分区和最大部分为K的分区之间的对偶性（取费勒斯图的转置），我们发现我们只需要对解码过程进行更改。

无论如何，这里有一些源代码：

import sys
import random
import time

if len(sys.argv) < 4 :
    sys.stderr.write("Usage: {0} N K iter\n".format(sys.argv[0]))
    sys.stderr.write("\tN = number to be partitioned\n")
    sys.stderr.write("\tK = number of parts\n")
    sys.stderr.write("\titer = number of iterations (if iter=0, enumerate all partitions)\n")
    quit()

N = int(sys.argv[1])
K = int(sys.argv[2])
iters = int(sys.argv[3])

if (N < K) :
    sys.stderr.write("Error: N<K ({0}<{1})\n".format(N,K))
    quit()

# B[n][k] = number of partitions of n with largest part equal to k
B = [[0 for j in range(K+1)] for i in range(N+1)] 

def calc_B(n,k) :
    for j in xrange(1,k+1) :
        for m in xrange(j, n+1) :
            if j == 1 :
                B[m][j] = 1
            elif m - j > 0 :
                B[m][j] = B[m-1][j-1] + B[m-j][j]
            else :
                B[m][j] = B[m-1][j-1]

def generate(n,k,r=None) :
    path = []
    append = path.append

    # Invalid input
    if n < k or n == 0 or k == 0: 
        return []

    # Pick random number between 1 and B[n][k] if r is not specified
    if r == None :
        r = random.randrange(1,B[n][k]+1)

    # Construct path from r    
    while r > 0 :
        if n==1 and k== 1:
            append('N')
            r = 0   ### Finish loop
        elif r <= B[n-k][k] and B[n-k][k] > 0  : # East/West Move
            append('E')
            n = n-k
        else : #  Northeast/Southwest move
            append('N')
            r -= B[n-k][k]
            n = n-1
            k = k-1

    # Decode path into partition    
    partition = []
    l = 0
    d = 0    
    append = partition.append    
    for i in reversed(path) :
        if i == 'N' :
            if d > 0 : # apply East moves all at once
                for j in xrange(l) :
                    partition[j] += d
            d = 0  # reset East moves
            append(1) # apply North move
            l += 1            
        else :
            d += 1 # accumulate East moves    
    if d > 0 : # apply any remaining East moves
        for j in xrange(l) :
            partition[j] += d

    return partition


t = time.clock()
sys.stderr.write("Generating B table... ")    
calc_B(N, K)
sys.stderr.write("Done ({0} seconds)\n".format(time.clock()-t))

bmax = B[N][K]
Bits = 0
sys.stderr.write("B[{0}][{1}]: {2}\t".format(N,K,bmax))
while bmax > 1 :
    bmax //= 2
    Bits += 1
sys.stderr.write("Bits: {0}\n".format(Bits))

if iters == 0 : # enumerate all partitions
    for i in xrange(1,B[N][K]+1) :
        print i,"\t",generate(N,K,i)

else : # generate random partitions
    t=time.clock()
    for i in xrange(1,iters+1) :
        Q = generate(N,K)
        print Q
        if i%1000==0 :
            sys.stderr.write("{0} written ({1:.3f} seconds)\r".format(i,time.clock()-t))

    sys.stderr.write("{0} written ({1:.3f} seconds total) ({2:.3f} iterations per second)\n".format(i, time.clock()-t, float(i)/(time.clock()-t) if time.clock()-t else 0))

以下是一些性能示例（在 MacBook Pro 8.3、2GHz i7、4 GB、Mac OSX 10.6.3、Python 2.6.1 上）：

mhum$ python part.py 20 5 10
Generating B table... Done (6.7e-05 seconds)
B[20][5]: 84    Bits: 6
[7, 6, 5, 1, 1]
[6, 6, 5, 2, 1]
[5, 5, 4, 3, 3]
[7, 4, 3, 3, 3]
[7, 5, 5, 2, 1]
[8, 6, 4, 1, 1]
[5, 4, 4, 4, 3]
[6, 5, 4, 3, 2]
[8, 6, 4, 1, 1]
[10, 4, 2, 2, 2]
10 written (0.000 seconds total) (37174.721 iterations per second)

mhum$ python part.py 20 5 1000000 > /dev/null
Generating B table... Done (5.9e-05 seconds)
B[20][5]: 84    Bits: 6
100000 written (2.013 seconds total) (49665.478 iterations per second)

mhum$ python part.py 200 25 100000 > /dev/null
Generating B table... Done (0.002296 seconds)
B[200][25]: 147151784574    Bits: 37
100000 written (8.342 seconds total) (11987.843 iterations per second)

mhum$ python part.py 3000 200 100000 > /dev/null
Generating B table... Done (0.313318 seconds)
B[3000][200]: 3297770929953648704695235165404132029244952980206369173   Bits: 181
100000 written (59.448 seconds total) (1682.135 iterations per second)

mhum$ python part.py 5000 2000 100000 > /dev/null
Generating B table... Done (4.829086 seconds)
B[5000][2000]: 496025142797537184410324290349759736884515893324969819660    Bits: 188
100000 written (255.328 seconds total) (391.653 iterations per second)

mhum$ python part-final2.py 20 3 0
Generating B table... Done (0.0 seconds)
B[20][3]: 33    Bits: 5
1   [7, 7, 6]
2   [8, 6, 6]
3   [8, 7, 5]
4   [9, 6, 5]
5   [10, 5, 5]
6   [8, 8, 4]
7   [9, 7, 4]
8   [10, 6, 4]
9   [11, 5, 4]
10  [12, 4, 4]
11  [9, 8, 3]
12  [10, 7, 3]
13  [11, 6, 3]
14  [12, 5, 3]
15  [13, 4, 3]
16  [14, 3, 3]
17  [9, 9, 2]
18  [10, 8, 2]
19  [11, 7, 2]
20  [12, 6, 2]
21  [13, 5, 2]
22  [14, 4, 2]
23  [15, 3, 2]
24  [16, 2, 2]
25  [10, 9, 1]
26  [11, 8, 1]
27  [12, 7, 1]
28  [13, 6, 1]
29  [14, 5, 1]
30  [15, 4, 1]
31  [16, 3, 1]
32  [17, 2, 1]
33  [18, 1, 1]

我将把它留给 OP 来验证此代码是否确实根据所需的（均匀）分布生成分区。

编辑：添加了枚举功能的示例。

score 0 · Accepted Answer

下面是一个完成我所要求的脚本，就重新编码表示 N 和 K 部分的整数分区的整数而言。需要一种更好的重新编码方法才能使该方法适用于 K > 4。这绝对不是最佳或首选方法。然而，它在概念上很简单，很容易被认为是基本无偏见的。小 K 也很快。脚本在 Sage notebook 中运行良好，不调用 Sage 函数。它不是随机抽样的脚本。随机抽样本身不是问题。

方法：

1.) 将整数分区视为它们的和被连接在一起并根据第一个词法分区中最大和的大小用零填充，例如 [17,1,1,1] -> 17010101 & [5,5,5,5 ] -> 05050505

2.) 将得到的整数视为从最大整数中减去它们（即表示第一个词法分区的 int）。例如 17010101 - 5050505 = 11959596

3.) 将每个结果减少的整数视为除以公分母，例如 11959596/99 = 120804

所以，如果我们想选择一个随机分区，我们会：

1.) 选择一个介于 0 和 120,804 之间的数字（而不是介于 5,050,505 和 17,010,101 之间的数字）

2.) 将数字乘以 99 并从 17010101 中减去

3.) 根据我们如何将每个整数视为用 0 填充来拆分结果整数

优点和缺点：如问题正文中所述，这种特殊的重新编码方法不足以大大提高随机选择代表 P 成员的整数的机会。对于少量部分，例如 K < 5 并且基本上更大的总数，例如 N > 100，实现此概念的函数可能非常快，因为该方法避免了及时递归（蛇吃尾巴），这会减慢其他随机分区函数或使其他函数无法处理大 N。

在小 K 时，在考虑其余过程的速度时，绘制 P 成员的概率可能是合理的。再加上快速的随机抽取、解码和评估，此函数可以为 N&K 的组合（例如 N = 20000，K = 4）找到其他算法无法实现的统一随机分区。非常需要一种更好的方法来重新编码整数，以使其成为一种普遍强大的方法。

import random
import sys

首先，一些通常有用且简单明了的功能

def first_partition(N,K):
    part = [N-K+1]
    ones = [1]*(K-1)
    part.extend(ones)
return part

def last_partition(N,K):
    most_even = [int(floor(float(N)/float(K)))]*K
    _remainder = int(N%K)
    j = 0

    while _remainder > 0:
        most_even[j] += 1
        _remainder -= 1
        j += 1
    return most_even

def first_part_nmax(N,K,Nmax):
    part = [Nmax]
    N -= Nmax
    K -= 1
    while N > 0:
        Nmax = min(Nmax,N-K+1)
        part.append(Nmax)
        N -= Nmax
        K -= 1

    return part


#print first_partition(20,4)
#print last_partition(20,4)
#print first_part_nmax(20,4,12)
#sys.exit()

def portion(alist, indices): 
    return [alist[i:j] for i, j in zip([0]+indices, indices+[None])]

def next_restricted_part(part,N,K): # *find next partition matching N&K w/out recursion

    if part == last_partition(N,K):return first_partition(N,K)
    for i in enumerate(reversed(part)):
        if i[1] - part[-1] > 1:
            if i[0] == (K-1):
                return first_part_nmax(N,K,(i[1]-1))

            else:
                parts = portion(part,[K-i[0]-1]) # split p
                h1 = parts[0]
                h2 = parts[1]
                next = first_part_nmax(sum(h2),len(h2),(h2[0]-1))
                return h1+next

""" *I don't know a math software that has this function and Nijenhuis and Wilf (1978) 
 don't give it (i.e. NEXPAR is not restricted by K). Apparently, folks often get the
 next restricted part using recursion, which is unnecessary """

def int_to_list(i): # convert an int to a list w/out padding with 0'
    return [int(x) for x in str(i)]

def int_to_list_fill(i,fill):# convert an int to a list and pad with 0's
    return [x for x in str(i).zfill(fill)]

def list_to_int(l):# convert a list to an integer
    return "".join(str(x) for x in l)

def part_to_int(part,fill):# convert an int to a partition of K parts
# and pad with the respective number of 0's

    p_list = []
    for p in part:
        if len(int_to_list(p)) != fill:
            l = int_to_list_fill(p,fill)
            p = list_to_int(l)
        p_list.append(p)
    _int = list_to_int(p_list)
    return _int       

def int_to_part(num,fill,K): # convert an int to a partition of K parts
    # and pad with the respective number of 0's
    # This function isn't called by the script, but I thought I'd include
    # it anyway because it would be used to recover the respective partition

    _list = int_to_list(num)
    if len(_list) != fill*K:
        ct = fill*K - len(_list) 
        while ct > 0:    
            _list.insert(0,0)
            ct -= 1    
    new_list1 = []
    new_list2 = []

    for i in _list:
        new_list1.append(i)
        if len(new_list1) == fill:
            new_list2.append(new_list1)
            new_list1 = []

    part = []
    for i in new_list2:
        j = int(list_to_int(i))
        part.append(j)

    return part

最后，我们得到总 N 和部分数量 K。下面将按词汇顺序打印满足 N&K 的分区，以及相关的重新编码整数

N = 20
K = 4

print '#,  partition,  coded,    _diff,    smaller_diff'

first_part = first_partition(N,K) # first lexical partition for N&K
fill = len(int_to_list(max(first_part)))
# pad with zeros to 1.) ensure a strictly decreasing relationship w/in P,
# 2.) keep track of (encode/decode) partition summand values

first_num = part_to_int(first_part,fill)
last_part = last_partition(N,K)
last_num = part_to_int(last_part,fill)

print '1',first_part,first_num,'',0,'      ',0

part = list(first_part)
ct = 1
while ct < 10:
    part = next_restricted_part(part,N,K)
    _num = part_to_int(part,fill)

    _diff = int(first_num) - int(_num)
    smaller_diff = (_diff/99)
    ct+=1

    print ct, part, _num,'',_diff,' ',smaller_diff

输出：

ct，分区，编码，_diff，smaller_diff

1 [17, 1, 1, 1] 17010101 0 0

2 [16, 2, 1, 1] 16020101 990000 10000

3 [15, 3, 1, 1] 15030101 1980000 20000

4 [15, 2, 2, 1] 15020201 1989900 20100

5 [14, 4, 1, 1] 14040101 2970000 30000

6 [14, 3, 2, 1] 14030201 2979900 30100

7 [14, 2, 2, 2] 14020202 2989899 30201

8 [13, 5, 1, 1] 13050101 3960000 40000

9 [13, 4, 2, 1] 13040201 3969900 40100

10 [13, 3, 3, 1] 13030301 3979800 40200

简而言之，最后一列中的整数可能要小得多。

为什么基于这种思想的随机抽样策略从根本上是无偏的：

N 的每个具有 K 个部分的整数分区对应于一个且仅一个重新编码的整数。也就是说，我们不会随机选择一个数字，对其进行解码，然后尝试重新排列元素以形成 N&K 的适当分区。因此，每个整数（无论是否对应于 N&K 的分区）都有相同的机会被绘制。目标是固有地减少不对应于 N 和 K 部分的分区的整数数量，从而使随机抽样的过程更快。

algorithm - 将一组大整数转换为一组小整数

2 回答 2

Related

Reference