python - 在python中合并数据时填补空白

Question

这是我的问题：我从几个矩阵开始，从中提取数据以构建一个新的通用矩阵。第一步是使用 csv 模块读取 infiles 并提取将用作最终矩阵中的列标题的“位置”值（存储在行 [1] 中）。每个 infile 包含总“位置”的一个子集，这些位置有时存在于多个 infile 中。所以我首先从所有“位置”值的合并中构建一个有序列表（从更小到更大的整数），忽略重复的值。这就是我的做法：

for infile in glob.glob('passed_*.vcf'):
    infilen=open(infile)
    inf = csv.reader(infilen,delimiter='\t')
    for row in inf:
        if row[1] in pos:
            continue
        else:
            pos.append(row[1])
    infilen.close()
pos.sort(key=int)
head=str('\t'.join(pos))
of=open('trial.txt', 'a')   
print>>of,head

完成此操作后，我返回原始 infiles 并读取另一个值（这次在 row[3] 中），我想在上面创建的相应标题下添加它（即“位置”）。由于每个 infile 都包含总位置的一个子集，因此当最终矩阵位置（存储在列表“pos”中）不存在于单个 infile 的行 [1] 中时，我将不得不填补空白。这是我正在尝试的代码：

for infile in glob.glob('passed_*.vcf'):
    infilen=open(infile)
    inf = csv.reader(infilen,delimiter='\t')
    seq=[]
    for row in inf:
        if row[1] in pos:
            seq.append(row[3])  
        else:
            seq.append('N')

不用说，我被困住了。我正在考虑使用 while 循环，但由于我没有真正的经验，所以我想请教您任何形式的建议。

样本数据

输入（样本 1）：

1   2025    blah    A   .   blah    PASS    AC=0    GT:DP   0/0:61
2   2027    blah    C   .   blah    blah    AC=0    GT:DP   0/0:61
3   2028    blah    T   .   blah    PASS    AC=0    GT:DP   0/0:61

输入（样本 n）：

1   2025    blah    G   .   blah    PASS    AC=0    GT:DP   0/0:61
2   2026    blah    A   .   blah    blah    AC=0    GT:DP   0/0:61
3   3089    blah    T   .   blah    PASS    AC=0    GT:DP   0/0:61

输出（单个矩阵，输入 row[1] 作为变量，row[3] 作为值。每一行是一个不同的样本，即不同的输入文件）：

          2025    2026    2027    2028  ...  3089
sample1    A       NaN     C       T         NaN
samplen    G        A     NaN     NaN         T

score 0 · Accepted Answer

>>> from collections import defaultdict
>>> import glob
>>> pos = defaultdict(dict)
>>> for index, infile in enumerate(glob.glob('D:\\DATA\\FP12210\\My Documents\\Temp\\Python\\sample*.vcf'), 1):
    for line in open(infile):
        # Convert value in integer already
        val, letter = int(line.split()[1]), line.split()[3]
        pos[val][index] = letter


>>> def print_pos(pos):
    """ Formats pos """
    # Print header by sorting keys of pos
    values = sorted(pos.keys())
    print '          ',
    for val in range(values[0], values[-1] + 1):
        print '{0:5}'.format(val),
    print

    # pos has keys according to row1, create pos2 with keys = sample #
    pos2 = defaultdict(dict)
    for val, d in pos.iteritems():
        for index, letter in d.iteritems():
            pos2[index][val] = letter

    # Now easier to print lines
    for index in sorted(pos2.keys()):
        print ' sample{0:2} '.format(index),
        for val in range(values[0], values[-1] + 1):
            if val in pos2[index]:
                print '   {0} '.format(pos2[index][val]),
            else:
                print ' NaN ',
        print


>>> print_pos(pos)
            2025  2026  2027  2028  2029  2030  2031  2032
 sample 1     A   NaN     C     T   NaN   NaN   NaN   NaN 
 sample 2     G     A   NaN   NaN   NaN   NaN   NaN     T 
>>>

我pos用来收集值，我也使用pos2包含不同排序的相同数据用于打印目的，因为：

pos以价值为导向，对具有价值范围很有用
pos2是面向样本的，对于给定样本编号的值很有用

为了没有太大的范围，我使用了值：

-sample1.vcf：

1   2025    blah    A   .   blah    PASS    AC=0    GT:DP   0/0:61
2   2027    blah    C   .   blah    blah    AC=0    GT:DP   0/0:61
3   2028    blah    T   .   blah    PASS    AC=0    GT:DP   0/0:61

-sample2.vcf：

1   2025    blah    G   .   blah    PASS    AC=0    GT:DP   0/0:61
2   2026    blah    A   .   blah    blah    AC=0    GT:DP   0/0:61
3   2032    blah    T   .   blah    PASS    AC=0    GT:DP   0/0:61

python - 在python中合并数据时填补空白

样本数据

1 回答 1

Related

Reference