0

我编写了以下脚本来检索每个包含的基因计数。它工作得很好,但ID list我用作输入的顺序在输出中没有保留。我需要保留与我的输入重叠群列表相同的顺序,具体取决于它们的表达水平有人可以帮助我吗?谢谢你的帮助。

from collections import defaultdict
import numpy as np
gene_list = {}
for line in open('idlist.txt'):
    columns = line.strip().split()
    gene = columns[0]
    rien = columns[1]
    gene_list[gene] = rien
gene_count = defaultdict(lambda: np.zeros(6, dtype=int))
out_file= open('out.txt','w')

esem_file = open('Aquilonia.txt')
esem_file.readline()
for line in esem_file:
    fields = line.strip().split()
    exon = fields[0]
    numbers = [float(field) for field in fields[1:]]
    if exon in gene_list.keys():
         gene = gene_list[exon]
         gene_count[gene] += numbers
         print >> out_file, gene, gene_count[gene]

input file:
comp54678_c0_seq3
comp56871_c2_seq8
comp56466_c0_seq5
comp57004_c0_seq1
comp54990_c0_seq11
...
output file comes back in numerical order:
comp100235_c0_seq1 [22 13 15  6 15 16]
comp101274_c0_seq1 [55  2 27 26  6  6]
comp101915_c0_seq1 [20  2 34 12  8  7]
comp101956_c0_seq1 [13 21 11 17 17 28]
comp101964_c0_seq1 [30 73 45 36  0  1]
4

1 回答 1

5

使用collections.OrderedDict();它按输入顺序保留条目。

from collections import OrderedDict

with open('idlist.txt') as idlist:
    gene_list = OrderedDict(line.split(None, 1) for line in idlist)

gene_list上面的代码使用一行读取您的有序字典。

但是,看起来好像您完全根据输入文件行的顺序生成输出文件:

for line in esem_file:
    # ...
    if exon in gene_list:  # no need to call `.keys()` here
        gene = gene_list[exon]
        gene_count[gene] += numbers
        print >> out_file, gene, gene_count[gene]

重新编写代码以首先收集计数,然后使用单独的循环写出数据:

with open('Aquilonia.txt') as esem_file:
    next(esem_file, None)  # skip first line
    for line in esem_file:
        fields = line.split()
        exon = fields[0]
        numbers = [float(field) for field in fields[1:]]
        if exon in gene_list:
             gene_count[gene_list[exon]] += numbers

with open('out.txt','w') as out_file:
    for gene in gene_list:
        print >> out_file, gene, gene_count[gene]
于 2013-06-17T10:52:45.433 回答