python - 在 Python 中使用随机模块中的选择时出错

Question

我正在尝试基于输入数据集构建随机数据集。输入数据集由 856471 行组成，每行中有一对由制表符分隔的值。随机数据集中的任何条目都不能等于输入数据集中的任何条目，这意味着：

如果第 1 行中的对是“Protein1 Protein2”，则随机数据集不能包含以下对：

“蛋白质 1 蛋白质 2”
“蛋白质 2 蛋白质 1”

为了实现这一点，我尝试了以下方法：

data = infile.readlines()
ltotal = len(data)
for line in data:
    words = string.split(line)

init = 0
while init != ltotal:
    p1 = random.choice(words)
    p2 = random.choice(words)
    words.remove(p1)
    words.remove(p2)
    if "%s\t%s\n" % (p1, p2) not in data and "%s\t%s\n" % (p2, p1) not in data:
        outfile.write("%s\t%s\n" % (p1, p2))

但是，我收到以下错误：

Traceback (most recent call last):   File
"C:\Users\eduarte\Desktop\negcreator.py", line 46, in <module>
    convert(indir, outdir)   File "C:\Users\eduarte\Desktop\negcreator.py", line 27, in convert
    p1 = random.choice(words)   File "C:\Python27\lib\random.py", line 274, in choice
    return seq[int(self.random() * len(seq))]  # raises IndexError if seq is empty
IndexError: list index out of range

我很确定这会奏效。我究竟做错了什么？提前致谢。

score 1 · Accepted Answer

words循环中的每一行都会覆盖该变量

for line in data:
    words = string.split(line)

这很可能不是您想要的。

此外，您的while循环是一个无限循环，words最终会消耗掉，没有选择random.choice().

编辑：我的猜测是你有一个制表符分隔的单词对文件，每行一对，你试图从所有单词中形成随机对，只将那些随机对写入输出文件中不出现在原始文件中。这是一些执行此操作的代码：

import itertools
import random
with open("infile") as infile:
    pairs = set(frozenset(line.split()) for line in infile)
words = list(itertools.chain.from_iterable(pairs))
random.shuffle(words)
with open("outfille", "w") as outfile:
    for pair in itertools.izip(*[iter(words)] * 2):
        if frozenset(pair) not in pairs:
            outfile.write("%s\t%s\n" % pair)

笔记：

一对单词由 a 表示frozenset，因为顺序无关紧要。
我set对所有对使用 a 以便能够测试一对是否在恒定时间内在集合中。
我没有重复使用random.choice()，而是只对整个列表进行一次洗牌，然后成对地迭代它。这样，我们不需要从列表中删除已经使用的单词，因此效率更高。（这一变化与前一个变化将方法的算法复杂度从 O(n²) 降低到 O(n)。）
该表达式itertools.izip(*[iter(words)] * 2)是一种常见的 Python 习惯用法，可以成对迭代words，以防您尚未遇到该表达式。
代码仍然未经测试。

python - 在 Python 中使用随机模块中的选择时出错

1 回答 1

Related

Reference