0

以下程序已在两个文件(txt,约 10MB ea.)上运行了约 22 小时。每个文件大约有 100K 行。有人可以告诉我我的代码效率有多低,也许是一种更快的方法。输入 dict 是有序的,并且保持顺序是必要的:

import collections

def uniq(input):
  output = []
  for x in input:
    if x not in output:
      output.append(x)
  return output

Su = {}
with open ('Sucrose_rivacombined.txt') as f:
    for line in f:
        (key, val) = line.split('\t')
        Su[(key)] = val
    Su_OD = collections.OrderedDict(Su)

Su_keys = Su_OD.keys()
Et = {}

with open ('Ethanol_rivacombined.txt') as g:
    for line in g:
        (key, val) = line.split('\t')
        Et[(key)] = val
    Et_OD = collections.OrderedDict(Et)

Et_keys = Et_OD.keys()

merged_keys = Su_keys + Et_keys
merged_keys =  uniq(merged_keys)

d3=collections.OrderedDict()

output_doc = open("compare.txt","w+")

for chr_local in merged_keys:
    line_output = chr_local
    if (Et.has_key(chr_local)):
        line_output = line_output + "\t" + Et[chr_local]
    else:
        line_output = line_output + "\t" + "ND"
    if (Su.has_key(chr_local)):
        line_output = line_output + "\t" + Su[chr_local]
    else:
        line_output = line_output + "\t" + "ND"

    output_doc.write(line_output + "\n")

输入文件如下:并非每个键都存在于两个文件中

Su:
chr1:3266359    80.64516129
chr1:3409983    100
chr1:3837894    75.70093458
chr1:3967565    100
chr1:3977957    100


Et:
chr1:3266359    95
chr1:3456683    78
chr1:3837894    54.93395855
chr1:3967565    100
chr1:3976722    23

我希望输出如下所示:

chr1:3266359    80.645    95
chr1:3456683    ND        78
4

3 回答 3

3

用这个替换uniq,因为输入是可散列的:

def uniq(input):
  output = []
  s = set()
  for x in input:
    if x not in s:
      output.append(x)
      s.add(x)
  return output

这会将一个几乎O(n^2)过程减少到几乎O(n)

于 2012-05-02T16:11:24.657 回答
1

你不需要你独特的功能。

伪代码如:

  1. 将文件 2 读取为 OrderedDict
  2. 处理文件 1 写出它的项目(已经正确订购)
  3. pop,在输出行的最后一部分使用文件 2 中的默认值
  4. 在文件一被消耗后,处理来自文件 2 的 Ordered dict

此外,爱情列表推导......您可以阅读该文件:

OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))

只有一个有序的 dict 和 'Sucrose_rivacombined.txt' 甚至从未进入内存。应该超级快

编辑完整代码(不确定您的输出行格式)

from collections import OrderedDict

Et_OD = OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))

with open("compare.txt","w+") as output_doc:
    for line in open('Sucrose_rivacombined.txt'):
        key,val = line.strip().split('\t')
        line_out = '\t'.join((key,val,Et_OD.pop(key,'ND')))
        output_doc.write(line_out+'\n')

    for key,val in Et_OD.items():
        line_out = '\t'.join((key,'ND',val))
        output_doc.write(line_out+'\n')
于 2012-05-02T16:43:24.537 回答
0

output是一个列表,但你的输入是字典:它们的键是唯一的,但你not in output需要与列表中的每个元素进行比较,这是组合的。not(由于该检查,您正在进行 n^2 比较。)

您可能完全可以将 uniq 替换为:

Su_OD.update(Et_OD)

这对我有用:

from collections import OrderedDict

one = OrderedDict()
two = OrderedDict()

one['hello'] = 'one'
one['world'] = 'one'

two['world'] = 'two'
two['cats'] = 'two'

one.update(two)

for k, v in one.iteritems():
    print k, v

输出:

    hello one
    world two
    cats two
于 2012-05-02T16:13:42.813 回答