简单的解决方案是按键列对文件进行排序,例如,按第二列对制表符分隔的输入进行排序:
#!/bin/bash
printf "a\tz\nb\ty\nc\tx" | sort -k 2 -t $'\t'
然后解决一个更简单的问题,即为每个唯一键检索 25% 的随机行,其中具有相同键的所有行都相邻,并且每个唯一键至少应保留一行:
#!/usr/bin/env python
import random
import sys
from itertools import chain, groupby
def choose_random(iterator, fraction, random=random.random):
"""Lazy analog of:
L = list(iterator)
k = int(len(L) * fraction + .5) or 1 # get at least one
result = random.sample(L, k)
Note: this function doesn't randomize the order of elements
that would require to keep selected elements in memory
and number of output elements is not exactly k
"""
# always yield at least one item if input is not empty
item = next(iterator)
it = (x for x in chain([item], iterator) if random() < fraction)
for x in chain([next(it, item)], it):
yield x
def getkey(line):
return line.split("\t")[1] # 2nd column
for key, group in groupby(sys.stdin, key=getkey):
sys.stdout.writelines(choose_random(group, fraction=0.25))
注意:输入文件的最后一行应该包含一个换行符,否则如果选择了最后一行,则输出会损坏。
该脚本接受标准输入上的排序(按键列)输入,并将减少的输出打印到标准输出。它一次只需要在内存中存储一行。这是一个单遍算法(O(n))。