3

I have two files as shown below:

File 1 (tab delimited):

A1   someinfo1     someinfo2    someinfo3
A1   someinfo1     someinfo2    someinfo3
B1   someinfo1     someinfo2    someinfo3
B1   someinfo1     someinfo2    someinfo3

File 2 (tab delimited):

A1   newinfo1     newinfo2    newinfo3
A1   newinfo1     newinfo2    newinfo3
B1   newinfo1     newinfo2    newinfo3
B1   newinfo1     newinfo2    newinfo3

I want to read two lines together (lines starting with A1 and A1) from File 1 and two lines (lines starting with A1 and A1) from File 2. To be more clear, I have two requirements:

1) Reading two lines from the same file
2) Read same two lines from the other file.  

To be precise, I want to read four lines together ( 2 consecutive lines from two files (2 lines from each file)).

I searched online and was able to get a code to read two lines together but only from one file.

with open(File1) as file1:
        for line1,line2 in itertools.izip_longest(*[file1]*2):

Also, I was also able to read one line from each of the two files as:

for i,(line1,line2) in enumerate(itertools.izip(f1,f2)):
        print line1, line2

But I want to do sth like:

Pseudocode:

for line1, line2 from file1 and line_1 and line_2 from file2:
              compare line1 with line2
              compare line1 with line_1
              compare line2 with line_1
              compare line2 with line_2

I am hoping a solution to be a linear time one. All the files have same number of lines and the first column (primary id) is same for the consecutive lines within a file and the other file follows the same order (See the above example).

Thanks.

4

4 回答 4

6

这个怎么样:

with open('a') as A, open('b') as B:
    while True:
        try:
            lineA1, lineA2, lineB1, lineB2 = next(A), next(A), next(B), next(B)
            # compare lines
            # ...
        except StopIteration:
            break
于 2013-01-16T23:39:05.580 回答
1
>>> from itertools import izip
>>> with open("file1") as file1, open("file2") as file2:
...     for a1, a2, b1, b2 in izip(file1, file1, file2, file2):
...         print a1, a2, b1, b2
... 
A1   someinfo1     someinfo2    someinfo3
A1   someinfo1     someinfo2    someinfo3
A1   newinfo1     newinfo2    newinfo3
A1   newinfo1     newinfo2    newinfo3

B1   someinfo1     someinfo2    someinfo3
B1   someinfo1     someinfo2    someinfo3
B1   newinfo1     newinfo2    newinfo3
B1   newinfo1     newinfo2    newinfo3

可以像这样将行数作为参数(n

for lines in izip(*[file1]*n+[file2]*n):

现在行将是一个带有n*2元素的元组

于 2013-01-17T00:08:16.377 回答
1

让我们看看如何将这些放在一起。第一的:

with open(File1) as file1:
    for line1,line2 in itertools.izip_longest(*[file1]*2):

好吧,去掉for循环,你就有了一个 2-line-at-a-time 迭代器 over file,对吧?因此,您可以对file2. 然后你可以zip把它们放在一起:

with open(File1) as file1, open(File2) as file2:
    f1 = itertools.izip_longest(*[file1]*2)
    f2 = itertools.izip_longest(*[file2]*2)
    for i,((f1_line1, f1_line2), (f2_line1, f2_line2)) in enumerate(itertools.izip(f1,f2)):
        # do stuff

但你真的不想这样做。

首先,大多数人没有直观地阅读izip_longest(*[file1]*2)并意识到它是成对分组的。把它包装成一个函数。事实上,甚至不要自己编写函数;从itertools 文档grouper中取出。

所以现在,它是:

with open(File1) as file1, open(File2) as file2:
    pairs1 = grouper(2, file1)
    pairs2 = grouper(2, file2)
    for i,((f1_line1, f1_line2), (f2_line1, f2_line2)) in enumerate(itertools.izip(f1,f2)):
        # do stuff

接下来,模式匹配可能很酷,但是在复杂表达式中间分解的嵌套模式有点太多了。所以,让我们把它分解,并通过再次flattenitertools文档中借用来取消嵌套:

with open(File1) as file1, open(File2) as file2:
    pairs1 = grouper(2, file1)
    pairs2 = grouper(2, file2)
    zipped_pairs = itertools.izip(pairs1, pairs2)
    for i, zipped_pair in enumerate(zipped_pairs):
        f1_line1, f1_line2, f2_line1, f2_line2 = flatten(zipped_pair)
        # do stuff

此解决方案的优点是它是抽象的和通用的,这意味着如果您以后决定需要 5 行或 3 个文件的组,则更改是显而易见的。

这种解决方案的缺点是它是抽象的和通用的,这意味着它不可能像做具体的等价物那样简单。(例如,如果你没有zip增加一对groupers,你就不必得到flatten结果。)

于 2013-01-16T23:39:48.753 回答
0

这是一个概括,它允许任意数量的连续行具有相同的 id 列:

from itertools import groupby, izip, product

getid = lambda line: line.partition(" ")[0] # first space-separated column
same_id = lambda lines: groupby(lines, key=getid)

with open(File1) as file1, open(File2) as file2:
     for (id1, lines1), (id2, lines2) in izip(same_id(file1), same_id(file2)):
         if id1 != id2: 
            # handle error here
            break
         # compare all possible combinations
         for a, b in product(lines1, lines2): 
             compare(a, b)
于 2013-01-17T00:08:25.170 回答