由于添加了从文件中选择的行在统计上均匀分布的要求,因此我提供了这种简单的方法。
"""randsamp - extract a random subset of n lines from a large file"""
import random
def scan_linepos(path):
"""return a list of seek offsets of the beginning of each line"""
linepos = []
offset = 0
with open(path) as inf:
# WARNING: CPython 2.7 file.tell() is not accurate on file.next()
for line in inf:
linepos.append(offset)
offset += len(line)
return linepos
def sample_lines(path, linepos, nsamp):
"""return nsamp lines from path where line offsets are in linepos"""
offsets = random.sample(linepos, nsamp)
offsets.sort() # this may make file reads more efficient
lines = []
with open(path) as inf:
for offset in offsets:
inf.seek(offset)
lines.append(inf.readline())
return lines
dataset = 'big_data.txt'
nsamp = 5
linepos = scan_linepos(dataset) # the scan only need be done once
lines = sample_lines(dataset, linepos, nsamp)
print 'selecting %d lines from a file of %d' % (nsamp, len(linepos))
print ''.join(lines)
我在一个包含 1.7GB 磁盘的 300 万行的模拟数据文件上对其进行了测试。scan_linepos
在我不太热的桌面上,运行时占主导地位大约 20 秒。
只是为了检查sample_lines
我使用timeit
模块的性能
import timeit
t = timeit.Timer('sample_lines(dataset, linepos, nsamp)',
'from __main__ import sample_lines, dataset, linepos, nsamp')
trials = 10 ** 4
elapsed = t.timeit(number=trials)
print u'%dk trials in %.2f seconds, %.2fµs per trial' % (trials/1000,
elapsed, (elapsed/trials) * (10 ** 6))
对于 的各种值nsamp
;当nsamp
为 100 时,单个sample_lines
在 460µs 内完成,并在每次调用 47ms 时线性扩展到 10k 个样本。
自然的下一个问题是Random 几乎不是随机的?,答案是“亚密码学,但对生物信息学来说肯定没问题”。