python - 同时从 2 个文件中读取每 4 行

Question

我正在处理较大的文本文件（10 MB gzip）。总是有 2 个文件属于一起，长度和结构都相同：每个数据集 4 行。

我需要同时处理两个文件中每个 4 块中第 2 行的数据。

我的问题：最省时的方法是什么？

现在我正在这样做：

def read_groupwise(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return itertools.izip_longest(fillvalue=fillvalue, *args)

f1 = gzip.open(file1,"r")
f2 = gzip.open(file2,"r")
for (fline1,fline2,fline3,fline4), (rline1, rline2, rline3, rline4) in zip(read_groupwise(f1, 4), read_groupwise(f2, 4)):
    # process fline2, rline2

但由于我只需要每个 line2，我猜可能有更有效的方法来做到这一点？

score 1 · Accepted Answer

我建议直接使用itertools.izip_longest压缩文件的内容，并使用itertools.islice从第 2 行开始选择每四个元素

>>> def get_nth(iterable, n, after=1, fillvalue = ""):
    return islice(izip_longest(*iterable,fillvalue=fillvalue), n, None, after)

>>> with gzip.open(file1, "r") as f1, gzip.open(file2, "r") as f2:
    for line in get_nth([f1, f2], n = 2):
        print map(str.strip, line)

score 1 · Accepted Answer

这可以通过构建自己的生成器来完成：

def get_nth(iterable, n, after=1):
    if after > 1:
        consume(iterable, after-1)
    while True:
        yield next(iterable)
        consume(iterable, n-1)

with gzip.open(file1, "r") as f1, gzip.open(file2, "r") as f2:
    every = (4, 2)
    for line_f1, line_f2 in zip(get_nth(f1, *every), get_nth(f2, *every)):
        ...

生成器前进到要给出的第一个项目（在这种情况下，我们想要第二个项目，所以我们跳过一个将迭代器放在第二个项目之前），然后产生一个值，然后前进到下一个项目之前. 这是完成手头任务的一种非常简单的方法。

这里使用consume()from itertools' recipes：

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

最后一点，我不确定是否gzip.open()提供了上下文管理器，如果没有，您将要使用contextlib.closing().

score 0 · Accepted Answer

如果您有记忆，请尝试：

ln1 = f1.readlines()[2::4]
ln2 = f2.readlines()[2::4]
for fline, rline in zip(ln1, ln2):
    ...

但前提是你有记忆。

python - 同时从 2 个文件中读取每 4 行

3 回答 3

Related

Reference