0

我有一个带有以下格式注释的文件:

  XS-5236245.2_hypothetical_protein

和一个制表符分隔的爆炸报告,第二列中只有加入 ID:

  transcript1  XS-5236245.2  94.3  35  0  245  356  789  896  1e-230 6.3

当匹配时,我想用注释文件中的整行替换爆炸报告中的 accession_id。这是我的尝试,如您所见,我使用非常基本的 python。如果您给我一个更复杂的解决方案,我将不胜感激。谢谢您的帮助。

林努

#!/usr/bin/python
#import sys

#input1 = sys.argv[1] --> file with annoations
#input2 = sys.argv[2] --> file with blast report
#output = sys.argv[3] --> modified blast report with annotations

f1 = open(sys.argv[1],"r")
f2 = open(sys.argv[2],"r")
f3 = open(sys.argv[3],"w")

#open and read line by line:
for line in f1:
        # break line by '_'
        splitline = line.split("_")
        # define search_id as the first element of the line
        searchid = splitline[0]
        # open blast report and read line by line
        for row in f2:
                # split columns by tab separator
                col = row.split("\t")
                # define target_id as the content of the second column
                targetid = col[1]
                # when target_id matches search_id replace content with the whole line
                if searchid == targetid:
                        f3.write(targetid.replace(searchid, splitline))
                else:
                        pass

f1.close()
f2.close()
f3.close()
4

1 回答 1

0

我找到了这样的解决方案:

  1. 创建一个包含两列的新文件 (accessionid_headers.txt),第一列具有登录 ID,第二列具有完整标题。使用 python 很容易:

    #!usr/bin/env python
    import sys
    
    f1 = open(sys.argv[1],'r')
    f2 = open(sys.argv[2],'w')
    
    for line in f1:
        splitline = line.split('_')
        accessionid = splitline[0]
        f2.write('{0} {1}'.format(accessionid, line))
    
    f1.close()
    f2.close()
    
  2. 爆炸报告格式如下:

    c16_g1_i1 len=581 path=[12725:0-580]    XS-5236245.2    94.9    59  3   0   403 579 254 312 8.6e-27 116.3
    
  3. 我应用了我修改的以下 awk 以适应我的文件,但是 appologies 因为我不是 awk 专家来完全解释每个步骤。也许有人可以贡献解释每个步骤在做什么:

    awk 'NR==FNR{a[$1]=$2;next}$4 in a{$4=a[$4]}1' accession_headers.txt blast.report > outfile 
    
于 2015-02-10T10:36:06.313 回答