我正在尝试通过一个大文本文件(~232GB)搜索一些关键字。我想利用缓冲来解决速度问题,还想记录包含这些关键字的行的开始位置。
我在这里看到很多帖子讨论类似的问题。然而,那些带有缓冲的解决方案(使用文件作为迭代器)不能给出正确的文件位置,而那些给出正确文件位置的解决方案通常只是简单地使用f.readline()
,它不使用缓冲。
我看到的唯一可以同时做到的答案是这里:
# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
line_offset.append(offset)
offset += len(line)
file.seek(0)
# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])
但是,我不确定该offset += len(line)
操作是否会花费不必要的时间。有没有更直接的方法来做到这一点?
更新:
我已经做了一些计时,但它似乎.readline()
比使用文件对象作为迭代器要慢得多,在python 2.7.3
. 我使用了以下代码
#!/usr/bin/python
from timeit import timeit
MAX_LINES = 10000000
# use file object as iterator
def read_iter():
with open('tweets.txt','r') as f:
lino = 0
for line in f:
lino+=1
if lino == MAX_LINES:
break
# use .readline()
def read_readline():
with open('tweets.txt', 'r') as f:
lino = 0
for line in iter(f.readline,''):
lino+=1
if lino == MAX_LINES:
break
# use offset+=len(line) to simulate f.tell() under binary mode
def read_iter_tell():
offset = 0
with open('tweets.txt','rb') as f:
lino = 0
for line in f:
lino+=1
offset+=len(line)
if lino == MAX_LINES:
break
# use f.tell() with .readline()
def read_readline_tell():
with open('tweets.txt', 'rb') as f:
lino = 0
for line in iter(f.readline,''):
lino+=1
offset = f.tell()
if lino == MAX_LINES:
break
print ("iter: %f"%timeit("read_iter()",number=1,setup="from __main__ import read_iter"))
print ("readline: %f"%timeit("read_readline()",number=1,setup="from __main__ import read_readline"))
print ("iter_tell: %f"%timeit("read_iter_tell()",number=1,setup="from __main__ import read_iter_tell"))
print ("readline_tell: %f"%timeit("read_readline_tell()",number=1,setup="from __main__ import read_readline_tell"))
结果是这样的:
iter: 5.079951
readline: 37.333189
iter_tell: 5.775822
readline_tell: 38.629598