0

我有一个充满这样行的文件(这只是文件的一小部分):

9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 Mycobacterium kansasii Mycobacteriaceae
7 Mycobacterium gastri Mycobacteriaceae
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 Brucella neotomae Brucellaceae
0 Brucella melitensis Brucellaceae
2 Mycobacterium phocaicum Mycobacteriaceae

数字指的是一个集群,然后是“属”“种”“科”。我想做的是编写一个程序,它会查看每一行并向我报告:每个集群中不同属的列表,以及集群中每个属的数量。所以我对簇号和每行中的第一个“单词”感兴趣。

我的麻烦是我不确定如何获取这些信息。我想我需要使用一个for循环,从以'0'开头的行开始。输出将是一个看起来像这样的文件:

Cluster 0: Brucella(2) # 列出簇,然后是簇中的属,带有编号,类似这样。
集群 1:链霉菌 (2)
集群 2:布鲁氏菌 (1)
等。

最终,我想对每个集群中的 Families 进行相同的计数,然后将 Genera 和 Species 放在一起。任何关于如何开始的想法将不胜感激!

4

2 回答 2

2

我认为这将是一个有趣的小玩具项目,所以我写了一个小技巧,从标准输入读取像你这样的输入文件,递归地计算和格式化输出,并吐出看起来有点像你的输出,但使用嵌套格式,像这样:

Cluster 0:
    Brucella(2)
        melitensis(1)
            Brucellaceae(1)
        neotomae(1)
            Brucellaceae(1)
    Streptomyces(1)
        neotomae(1)
            Brucellaceae(1)
Cluster 1:
    Streptomyces(2)
        geysiriensis(1)
            Streptomycetaceae(1)
        minutiscleroticus(1)
            Streptomycetaceae(1)
Cluster 2:
    Mycobacterium(1)
        phocaicum(1)
            Mycobacteriaceae(1)
Cluster 7:
    Mycobacterium(2)
        gastri(1)
            Mycobacteriaceae(1)
        kansasii(1)
            Mycobacteriaceae(1)
Cluster 9:
    Hyphomicrobium(2)
        facile(2)
            Hyphomicrobiaceae(2)
Cluster 10:
    Streptomyces(2)
        niger(1)
            Streptomycetaceae(1)
        olivaceiscleroticus(1)
            Streptomycetaceae(1)

我还在我的表中添加了一些垃圾数据,以便我可以在集群 0 中看到一个额外的条目,与其他两个分开......这里的想法是你应该能够看到顶级“集群”条目,然后嵌套, 属、种、科的缩进条目......我希望它也不难扩展到更深的树。

# Sys for stdio stuff
import sys
# re for the re.split -- this can go if you find another way to parse your data
import re


# A global (shame on me) for storing the data we're going to parse from stdin
data = []

# read lines from standard input until it's empty (end-of-file)
for line in sys.stdin:
    # Split lines on spaces (gobbling multiple spaces for robustness)
    # and trim whitespace off the beginning and end of input (strip)
    entry = re.split("\s+", line.strip())

    # Throw the array into my global data array, it'll look like this:
    # [ "0", "Brucella", "melitensis", "Brucellaceae" ]
    # A lot of this code assumes that the first field is an integer, what
    # you call "cluster" in your problem description
    data.append(entry)

# Sort, first key is expected to be an integer, and we want a numerical
# sort rather than a string sort, so convert to int, then sort by
# each subsequent column. The lamba is a function that returns a tuple
# of keys we care about for each item
data.sort(key=lambda item: (int(item[0]), item[1], item[2], item[3]))


# Our recursive function -- we're basically going to treat "data" as a tree,
# even though it's not.
# parameters:
#    start - an integer telling us what line to begin working from so we needn't
#            walk the whole tree each time to figure out where we are.
#    super - An array that captures where we are in the search. This array
#            will have more elements in it as we deepen our traversal of the "tree"
#            Initially, it will be []
#            In the next ply of the tree, it will be [ '0' ]
#            Then something like [ '0', 'Brucella' ] and so on.
#    data -  The global data structure -- this never mutates after the sort above,
#            I could have just used the global directly
def groupedReport(start, super, data):
    # Figure out what ply we're on in our depth-first traversal of the tree
    depth =  len(super)
    # Count entries in the super class, starting from "start" index in the array:
    count = 0

    # For the few records in the data file that match our "super" exactly, we count
    # occurrences.
    if depth != 0:
        for i in range(start, len(data)):
            if (data[i][0:depth] == data[start][0:depth]):
                count = count + 1
            else:
                break; # We can stop counting as soon as a match fails,
                   # because of the way our input data is sorted
    else:
        count = len(data)


    # At depth == 1, we're reporting about clusters -- this is the only piece of
    # the algorithm that's not truly abstract, and it's only for presentation
    if (depth == 1):
        sys.stdout.write("Cluster " + super[0] + ":\n")
    elif (depth > 0):
        # Every other depth: indent with 4 spaces for every ply of depth, then
        # output the unique field we just counted, and its count
        sys.stdout.write((' ' * ((depth - 1) * 4)) +
                         data[start][depth - 1] + '(' + str(count) + ')\n')

    # Recursion: we're going to figure out a new depth and a new "super"
    # and then call ourselves again. We break out on depth == 4 because
    # of one other assumption (I lied before about the abstract thing) I'm
    # making about our input data here. This could
    # be made more robust/flexible without a lot of work
    subsuper = None
    substart = start
    for i in range(start, start + count):
        record = data[i]  # The original record from our data
        newdepth = depth + 1
        if (newdepth > 4): break;

        # array splice creates a new copy
        newsuper = record[0:newdepth]
        if newsuper != subsuper:
            # Recursion here!
            groupedReport(substart, newsuper, data)
            # Track our new "subsuper" for subsequent comparisons
            # as we loop through matches
            subsuper = newsuper

        # Track position in the data for next recursion, so we can start on
        # the right line
        substart = substart + 1

# First call to groupedReport starts the recursion
groupedReport(0, [], data)

如果你把我的 Python 代码变成像“classifier.py”这样的文件,那么你可以像这样通过它运行你的 input.txt 文件(或任何你称之为的文件):

cat input.txt | python classifier.py

如果您愿意的话,递归的大部分魔力是使用数组切片实现的,并且很大程度上依赖于比较数组切片的能力,以及我可以使用我的排序例程对输入数据进行有意义的排序这一事实。如果大小写不一致可能导致不匹配,您可能希望将输入数据转换为全小写。

于 2013-06-27T23:06:35.660 回答
1

这很容易做到。

  1. 创建一个空字典{}来存储你的结果,我们称之为“结果”
  2. 逐行循环数据。
  3. 根据您的结构拆分空间线以获得4个元素,cluster,genus,species,family

  4. 当在当前循环中找到每个簇键中的属时,增加它们的计数,但必须将它们设置为 1 才能首次出现。

result = { '0': { 'Brucella': 2} ,'1':{'Streptomyces':2}..... }

代码:

my_data = """9 Hyphomicrobium facile Hyphomicrobiaceae                                                   
9 Hyphomicrobium facile Hyphomicrobiaceae                                                                
7 Mycobacterium kansasii Mycobacteriaceae                                                                
7 Mycobacterium gastri Mycobacteriaceae                                                                  
10 Streptomyces olivaceiscleroticus Streptomycetaceae                                                    
10 Streptomyces niger Streptomycetaceae                                                                  
1 Streptomyces geysiriensis Streptomycetaceae                                                            
1 Streptomyces minutiscleroticus Streptomycetaceae                                                       
0 Brucella neotomae Brucellaceae                                                                         
0 Brucella melitensis Brucellaceae                                                                       
2 Mycobacterium phocaicum Mycobacteriaceae"""

result = {}
for line in my_data.split("\n"):
    cluster,genus,species,family = line.split(" ")
    result.setdefault(cluster,{}).setdefault(genus,0)
    result[cluster][genus] += 1

print(result)


{'10': {'Streptomyces': 2}, '1': {'Streptomyces': 2}, '0': {'Brucella': 2}, '2': {'Mycobacterium': 1}, '7': {'Mycobacterium': 2}, '9': {'Hyphomicrobium': 2}}
于 2013-06-27T18:39:05.283 回答