2

我面临这个问题。我的字典中有 10,000 行,这是其中的一行

示例:打印输出时 A (8) C (4) G (48419) T (2)

我想得到'G'作为答案,因为它具有最高的价值。

我目前正在使用 Python 2.4,但我不知道如何解决这个问题,因为我对 Python 还是很陌生。

非常感谢您提供的任何帮助:)

4

5 回答 5

3

Here's a solution that

  1. uses a regexp to scan all occurrences of an uppercase letter followed by a number in brackets
  2. transforms the string pairs from the regexp with a generator expression into (value,key) tuples
  3. returns the key from the tuple that has the highest value

I also added a main function so that the script can be used as a command line tool to read all lines from one file and the write the key with the highest value for each line to an output file. The program uses iterators, so that it is memory efficient no matter how large the input file is.

import re
KEYVAL = re.compile(r"([A-Z])\s*\((\d+)\)")

def max_item(row):
    return max((int(v),k) for k,v in KEYVAL.findall(row))[1]

def max_item_lines(fh):
    for row in fh:
        yield "%s\n" % max_item(row)

def process_file(infilename, outfilename):
    infile = open(infilename)
    max_items = max_item_lines(infile)
    outfile = open(outfilename, "w")
    outfile.writelines(max_items)
    outfile.close()

if __name__ == '__main__':
    import sys
    infilename, outfilename = sys.argv[1:]
    process_file(infilename, outfilename)

For a single row, you can call:

>>> max_item("A (8) C (4) G (48419) T (2)")
'G'

And to process a complete file:

>>> process_file("inputfile.txt", "outputfile.txt")

If you want an actual Python list of every row's maximum value, then you can use:

>>> map(max_item, open("inputfile.txt"))
于 2011-02-07T12:44:50.503 回答
1
max(d.itervalues())

这将比说 d.values() 快得多,因为它使用的是可迭代的。

于 2011-02-07T10:06:57.123 回答
1

尝试以下操作:

st = "A (8) C (4) G (48419) T (2)" # your start string
a=st.split(")")
b=[x.replace("(","").strip() for x in a if x!=""]
c=[x.split(" ") for x in b]
d=[(int(x[1]),x[0]) for x in c]
max(d) # this is your result.
于 2011-02-07T10:30:46.653 回答
0

使用正则表达式来分割行。然后对于所有匹配的组,您必须将匹配的字符串转换为数字,获取最大值,并找出对应的字母。

import re
r = re.compile('A \((\d+)\) C \((\d+)\) G \((\d+)\) T \((\d+)\)')
for line in my_file:
  m = r.match(line)
  if not m:
    continue # or complain about invalid line
  value, n = max((int(value), n) for (n, value) in enumerate(m.groups()))
  print "ACGT"[n], value
于 2011-02-07T09:49:20.460 回答
0
row = "A (8) C (4) G (48419) T (2)"

lst = row.replace("(",'').replace(")",'').split() # ['A', '8', 'C', '4', 'G', '48419', 'T', '2']

dd = dict(zip(lst[0::2],map(int,lst[1::2]))) # {'A': 8, 'C': 4, 'T': 2, 'G': 48419} 

max(map(lambda k:[dd[k],k], dd))[1] # 'G'
于 2011-03-15T19:42:32.107 回答