0

我有一个文件,每行都有一个唯一的 ID 号。我正在尝试在不同的文件中搜索这些 ID 号的出现,并将这些 ID 号在第二个文件中的行返回,在本例中为输出文件。我是编程新手,这就是我到目前为止所拥有的。

outlist = []
with open('readID.txt', 'r') as readID, \
     open('GOlines.txt', 'w') as output, \
     open('GO.txt', 'r') as GO:  
     x = readID.readlines()
     print x
     for line in GO:
        if x[1:-1] in line:
        outlist.append(line)
        outlist.append('\n')

     if x[1:-1] in line:
        outlist.append(line)
        outlist.append('\n')
     print outlist
     output.writelines(outlist)

文件如下所示:readID.txt

00073810.1
00082422.1
00018647.1
00063072.1

GO.txt

#query  GO  reference DB    reference family    
HumanDistalGut_READ_00048904.2  GO:0006412  TIGRFAM TIGR00001    
HumanDistalGut_READ_00043244.3  GO:0022625  TIGRFAM TIGR00001    
HumanDistalGut_READ_00048644.4  GO:0000315  TIGRFAM TIGR00001   
HumanDistalGut_READ_00067264.5  GO:0003735  TIGRFAM TIGR00001

读取的 id 与READ ...之后的一些但不是全部的 id 匹配

4

3 回答 3

0
#!/usr/bin/env python
# encoding: utf-8

import sys
import re

def extract_id(line):
    """
    input: HumanDistalGut_READ_00048904.2  GO:0006412  TIGRFAM TIGR00001
    returns: 00048904.2
    """
    result = re.search(r'READ_(\d{8}\.\d)', line)
    if result != None:
        return result.group(1)
    else:
        return None

def extract_go_num(line):
    """
    input: HumanDistalGut_READ_00048904.2  GO:0006412  TIGRFAM TIGR00001
    returns: 0006412
    """
    result = re.search(r'GO:(\d{7})', line)
    if result != None:
        return result.group(1)
    else:
        return None

def main(argv = None):
    if argv is None:
        argv = sys.argv

    with open('readID.txt', 'r') as f:
        ids = frozenset(f.readlines())

    with open('GO.txt', 'r') as haystack, \
        open('GOLines.txt', 'w') as output:

        for line in haystack:
            if extract_id(line) in ids:
                output.write(extract_go_num(line) + '\n')

if __name__ == "__main__":
    sys.exit(main())

我正在为 O(n) 解决方案而不是 O(n^2) 交换内存开销。

我正在使用正则表达式来提取 ids 和 go 数字,但如果数字数量发生变化,它会很脆弱。

于 2013-02-18T22:11:48.213 回答
0

也许是这样的:

with open('readID.txt', 'r') as readID, open('GOlines.txt', 'w') as output, open('GO.txt', 'r') as GO:
    for ID in readID:
        for line in GO:
            if ID in line:
                output.write(line)
于 2013-02-18T22:12:49.043 回答
0

如果您的文件足够小以适合您的内存。

with open('/somepath/GO.txt') as f:
    pool = f.readlines()

with open('/somepath/readID.txt') as f:    
    tokens = f.readlines()

# strip spaces/new lines
tokens = [t.strip() for t in tokens]
found = [(t, lno) for t in tokens for (lno, l) in enumerate(pool) if t in l]

found然后,您可以将您的列表打印到您的输出文件中。

于 2013-02-18T22:21:39.530 回答