python - 如何将两个未知大小的依赖输入与变量联系起来

Question

这是我的第一个 python 脚本。我的数据如下所示：

Position ind1 ind2 ind3 ind4 ind5 ind5 ind7 ind8
 0        C    A     C   A    A    A    A     A
 1        C    A     C   C    C    A    A     A

但它的列数可能会有所不同，并且有数千行。

我的脚本执行我需要的操作，逐行读取此文件，并计算每个位置 (POS) 中个人（以下称为人口）组合的 A 和 C 的频率。例如人口 1 (ind1, ind2, ind3, ind4) 的位置 0 的 A 的频率；对于人口 2 (ind5, ind6, ind7, ind8)，A 在位置 0 的频率和频率，那么对于 POS 1、2、3 ... 也是如此。

为此，我通过以下代码在我的脚本中定义列（人口）的组合：

alleles1 = alleles[1:5]
alleles2 = alleles[5:]

但如果我有超过 9 列和不同的列组合，我需要在之后修改等位基因*和脚本的其余部分。

我想让我的程序更具交互性，用户定义人口数量并指定哪一列对应于哪个人口。

我到目前为止的代码：

#ask for the number of populations
try:
    num_pop = int(raw_input("How many populations do you have? > "))
except ValueError:
    print "In is not an integer! \nThe program exits...\n "
#ask for individuals in population
ind_pop = {}
for i in range(num_pop):
    i += 1
    ind_input = str(raw_input("Type column numbers of population %i > " % i))
    ind_pop[i] = re.findall(r'[^,;\s]+', ind_input)

如果我有 2 个人口，其中第 3、5、6 列是人口 1，第 2、5 列是人口 2。它以这种方式工作：

> How many populations do you have? > 2
> Type column numbers of population 1 > 3, 5, 6  
> Type column numbers of population 2 > 2, 4

输入存储在字典中。

{1: ['3', '5', '6'], 2: ['2', '4']}

问题是如何从这个输入着手定义等位基因。输出应该是这样的：

allele1 =  [allele[3], allele[5], allele[6]]
allele2 =  [allele[2], allele[4]]

如果有必要，这里是其余代码的主要部分：

with open('test_file.txt') as datafile:
  next(datafile)
  for line in datafile:
    words = line.split() #splits string into the list of words 
    chr_pos = words[0:2] #select column chromosome and position
    alleles = words[2:] # this and next separates alleles for populations

    alleles1 = alleles[0:4]
    alleles2 = alleles[4:8]
    alleles3 = alleles[8:12]
    alleles4 = alleles[12:16]

    counter1=collections.Counter(alleles1)
    counter2=collections.Counter(alleles1)
    counter3=collections.Counter(alleles1)
    counter4=collections.Counter(alleles1)
#### the rest of the code and some filters within the part above were spiked

score 2 · Accepted Answer

您首先需要将列号转换为整数

    ind_pop[i] = [int(j) for j in re.findall(r'[^,;\s]+', ind_input)]

（我还将您的正则表达式更改为r'\d+'）

然后，而不是拥有alleles1等alleles2，拥有一个主列表或字典：

master = {i: [alleles[j] for j in vals] for i, vals in ind_pop.items()}
counters = {i: collections.Counter(al) for i, al in master.items()}

然后你可以访问counters[i]而不是counter1等。

作为旁注，您可以通过制作ind_pop一个列表来简化上述所有内容，使用append而不是保留计数器

score 0 · Accepted Answer

感谢您的建议。其中一些很有用。我觉得我需要改变方向。我将继续使用列表列表：

pop_alleles = []
for key in ind_pop.keys():
  pop_alleles.append([alleles[el] for el in ind_pop[key]])

score 0 · Accepted Answer

如果这是您正在寻找的输出，

allele1 =  [allele[3], allele[5], allele[6]]
allele2 =  [allele[2], allele[4]]

你有这个：

{1: ['3', '5', '6'], 2: ['2', '4']}

从这里开始很简单。

for index in population_dict[1]:
    allele1.append(allele[index])
for index in population_dict[2]:
    allele2.append(allele[index])

哦，如果索引存储为字符串，就像它们在上面一样，您需要先将它们设为整数。您可以将上面的内容更改为等位基因[int(index)]，但是当您阅读它们时最好将它们变成整数。

python - 如何将两个未知大小的依赖输入与变量联系起来

3 回答 3

Related

Reference