python - 生成唯一数字的大随机序列

Question

我需要用数字（测试数据）标识的大量记录填充一个文件。记录的数量很大，ids应该是唯一的，记录的顺序应该是随机的（或伪随机的）。

我试过这个：

# coding: utf-8
import random

COUNT = 100000000

random.seed(0)
file_1 = open('file1', 'w')
for i in random.sample(xrange(COUNT), COUNT):
    file_1.write('ID{0},A{0}\n'.format(i))
file_1.close()

但它正在吞噬我所有的记忆。

有没有办法生成一个大的连续（不一定，但它会很好，否则是唯一的）整数的洗牌序列？使用生成器而不将所有序列保存在 RAM 中？

score 9 · Accepted Answer

如果您在问题中有 1 亿个数字，那么这实际上是可管理的内存（大约需要 0.5 GB）。

正如 DSM 指出的那样，这可以通过标准模块以有效的方式完成：

>>> import array
>>> a = array.array('I', xrange(10**8))  # a.itemsize indicates 4 bytes per element => about 0.5 GB
>>> import random                                                               
>>> random.shuffle(a)

也可以使用第三方 NumPy 包，它是用于高效管理数组的标准 Python 工具：

>>> import numpy
>>> ids = numpy.arange(100000000, dtype='uint32')  # 32 bits is enough for numbers up to about 4 billion
>>> numpy.random.shuffle(ids)

（这仅在您的程序已经使用 NumPy 时才有用，因为标准模块方法几乎同样有效）。

这两种方法在我的机器上花费的时间大致相同（洗牌可能需要 1 分钟），但它们使用的 0.5 GB 对于当前的计算机来说并不算大。

PS：与使用的随机生成器的周期相比，洗牌的元素太多而不能真正随机，因为可能有太多的排列。换句话说，Python shuffle 的数量少于可能的 shuffle 数量！

score 4 · Accepted Answer

也许是这样的（不会是连续的，但会是独一无二的）：

from uuid import uuid4

def unique_nums():  # Not strictly unique, but *practically* unique
    while True:
        yield int(uuid4().hex, 16)
        # alternative yield uuid4().int

unique_num = unique_nums()
next(unique_num)
next(unique_num) # etc...

score 0 · Accepted Answer

这将使您的记忆保持正常，但可能会杀死您的磁盘:)

它生成一个数字序列从 0 到 100000000 的文件，然后随机选择其中的位置并写入另一个文件。必须在第一个文件中重新组织数字以“删除”已经选择的数字。

import random

COUNT = 100000000

# Feed the file
with open('file1','w') as f:
    i = 0
    while i <= COUNT:
        f.write("{0:08d}".format(i))
        i += 1

with open('file1','r+') as f1:
    i = COUNT
    with open('file2','w') as f2:
        while i >= 0:
            f1.seek(i*8)
            # Read the last val
            last_val = f1.read(8)
            random_pos = random.randint(0, i)
            # Read random pos
            f1.seek(random_pos*8)
            random_val = f1.read(8)
            f2.write('ID{0},A{0}\n'.format(random_val))
            # Write the last value to this position
            f1.seek(random_pos*8)
            f1.write(last_val)
            i -= 1
print "Done"

score 0 · Accepted Answer

/dev/urandom您可以通过阅读（在 linux 上）或使用os.urandom()and轻松获取随机整数struct.unpack()：

返回适合加密使用的 n 个随机字节的字符串。

此函数从特定于操作系统的随机源返回随机字节。返回的数据对于加密应用程序来说应该是不可预测的，尽管它的确切质量取决于操作系统的实现。在类 UNIX 系统上，这将查询/dev/urandom，而在 Windows 上，它将使用CryptGenRandom。如果未找到随机源，则会引发NotImplementedError 。

>>> for i in range(4): print( hex( struct.unpack('<L', os.urandom(4))[0]))
... 
0xbd7b6def
0xd3ecf2e6
0xf570b955
0xe30babb6

另一方面random包装：

但是，由于是完全确定性的，它并不适合所有用途，并且完全不适合加密用途。

如果您真的需要独特的记录，您应该使用此或EOL 提供的答案。

但是假设真的是随机源，可能有重复的字符，你将有1/N（where N = 2 ** sizeof(int)*8 = 2 ** 32）第一次猜测命中项目的机会，因此你可以获得(2**32) ** length可能的输出。

另一方面，当仅使用独特的结果时，您将拥有 max：

product from i = 0 to length {2*32 - i} 
               = n! / (n-length)!
               = (2**32)! / (2**32-length)!

阶乘在哪里!，而不是逻辑否定。所以你只会减少结果的随机性。

python - 生成唯一数字的大随机序列

4 回答 4

Related

Reference