0

我有一个文件,其中包含物种的 id 和血统信息。

例如:

162,Bacteria,Spirochaetes,Treponemataceae,Spirochaetia,Treponema
174,Bacteria,Spirochaetes,Leptospiraceae,Spirochaetia,Leptospira
192,Bacteria,Proteobacteria,Azospirillaceae,Alphaproteobacteria,Azospirillum
195,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
197,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
199,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
201,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
2829358,,,,,
2806529,Eukaryota,Nematoda,,,

我正在编写一个脚本,我需要根据用户输入获取每个谱系的计数(即,如果是属,那么我将查看每行中的最后一个单词,例如密螺旋体,如果是类,然后是第四个等) .

稍后我需要将计数转换为数据框,但首先我试图将这个血统信息文件转换为字典,其中取决于用户输入,血统信息(即我们说属)是关键,而 id 是价值。这是因为可能有多个 id 与相同的谱系信息匹配,例如 id 195、197、199、201 都会返回弯曲杆菌的命中。

这是我的代码:

def create_dicts(filename):
        '''Transforms the unique_taxids_lineage_allsamples file into a dictionary.
        Note: There can be multiple ids mapping to the same lineage info. Therefore ids should be values.'''
        # Creating a genus_dict
        unique_ids_dict={} # the main dict to return
        phylum_dict=2 # third item in line
        family_dict=3 # fourth item in line
        class_dict=4 # fifth item in line
        genus_dict=5 # sixth item in line
        type_of_dict=input("What type of dict? 2=phylum, 3=family, 4=class, 5=genus\n")

        with open(filename, 'r') as f:
                content = f.readlines()

        for line in content:
                key = line.split(",")[int(type_of_dict)].strip("\n") # lineage info
                value = line.split(",")[0].split("\n") # the id, there can be multiple mapping to the same key
                if key in unique_ids_dict:  # if the lineage info is already a key, skip
                        unique_ids_dict[key].append(value)
                else:
                        unique_ids_dict[key]=value
        return unique_ids_dict

我不得不在值的末尾添加 .split("\n") ,因为我不断收到 str 对象没有属性附加的错误。

如果用户输入为 5 的属,我正在尝试获取如下字典:

unique_ids_dict={'Treponema': ['162'], 'Leptospira': ['174'], 'Azospirillum': ['192'], 'Campylobacter': ['195', '197', '199', '201'], '': ['2829358', '2806529']}

但相反,我得到以下信息:

unique_ids_dict={'Treponema': ['162'], 'Leptospira': ['174'], 'Azospirillum': ['192'], 'Campylobacter': ['195', ['197'], ['199'], ['201']], '': ['2829358', ['2806529']]} ##missing str "NONE" haven't figured out how to convert empty strings to say "NONE"

此外,如果有人知道如何将所有空点击转换为“NONE”或以下内容,那就太好了。这是一个次要问题,所以如果需要,我可以将其作为一个单独的问题打开。

谢谢!

已解决~~~~ 需要使用扩展而不是附加。

要将 emtpy 字符串更改为变量,我使用了 dict.pop 所以在我的 if 语句之后

unique_ids_dict["NONE"] = unique_ids_dict.pop("")

谢谢!

4

2 回答 2

1
def create_dicts(filename):
    '''Transforms the unique_taxids_lineage_allsamples file into a dictionary.
    Note: There can be multiple ids mapping to the same lineage info. Therefore ids should be values.'''
    # Creating a genus_dict
    unique_ids_dict = {}  # the main dict to return
    phylum_dict = 2  # third item in line
    family_dict = 3  # fourth item in line
    class_dict = 4  # fifth item in line
    genus_dict = 5  # sixth item in line
    type_of_dict = input("What type of dict? 2=phylum, 3=family, 4=class, 5=genus\n")

with open(filename, 'r') as f:
    content = f.readlines()

for line in content:
    key = line.split(",")[int(type_of_dict)].strip("\n")  # lineage info
    value = line.split(",")[0].split("\n")  # the id, there can be multiple mapping to the same key
    if key in unique_ids_dict:  # if the lineage info is already a key, skip
        unique_ids_dict[key].**extend**(value)
    else:
        unique_ids_dict[key] = value
return unique_ids_dict

这对我有用。在列表上使用扩展不附加。

于 2021-07-15T16:32:23.963 回答
0

我建议你使用 Pandas,它更简单,而且确保标题名称也很好:

import pandas as pd


def create_dicts(filename):
        """
        Transforms the unique_taxids_lineage_allsamples file into a 
        dictionary.
        
        Note: There can be multiple ids mapping to the same lineage info. 
        Therefore ids should be values.
        """
        
        # Reading File:
        content = pd.read_csv(
            filename,
            names=("ID", "Kingdom", "Phylum", "Family", "Class", "Genus")
        )

        # Printing input and choosing clade to work with:
        print("\nWhat type of dict?")
        print("- Phylum")
        print("- Family")
        print("- Class")
        print("- Genus")
        
        clade = input("> ").capitalize()

        # Replacing empty values with string 'None':
        content = content.where(pd.notnull(content), "None")
        
        # Selecting columns and aggregating accordingly to the chosen
        # clade and ID:
        series = content.groupby(clade).agg("ID").unique()

        # Creating dict:
        content_dict = series.to_dict()

        # If you do not want to work with Numpy arrays, just create
        # another dict of lists:
        content_dict = {k:list(v) for k, v in content_dict.items()}

        return content_dict


if __name__ == "__main__":
    
    d = create_dicts("temp.csv")

    print(d)

临时文件:

162,Bacteria,Spirochaetes,Treponemataceae,Spirochaetia,Treponema
174,Bacteria,Spirochaetes,Leptospiraceae,Spirochaetia,Leptospira
192,Bacteria,Proteobacteria,Azospirillaceae,Alphaproteobacteria,Azospirillum
195,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
197,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
199,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
201,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
829358,,,,,
2806529,Eukaryota,Nematoda,,,

我希望这是你想做的。

于 2021-07-15T17:31:19.967 回答