python-2.7 - 如何使用 python 将制表符分隔格式的 SNP 矩阵转换为数字？

Question

我正在尝试将制表符分隔文件中的基数转换为整数。输入文件必须复制，新副本必须有一个“1”作为参考碱基，“2”表示最常见的替代等位基因，“3”表示下一个，依此类推。我是 python 新手，所以欢迎任何帮助，我一直在尝试使用 pandas，因为文件太大了。以下是数据示例：

dfmolecule      refpos  syn?    refbase 0.1288  1304    09BKT076207
NC_011353       68      NA            A       A    A        A
NC_011353       255     NSYN          A       A    A        A
NC_011353       493     NSYN          T       T    T        C
NC_011353       514     NSYN          C       C    C        C
NC_011353       1790    SYN           G       G    G        G
NC_011353       1798    NSYN          A       A    A        T
NC_011353       2015    SYN           C       C    C        C
NC_011353       2345    SYN           T       T    T        T
NC_011353       2655    NSYN          C       C    C        C
NC_011353       2906    NSYN          C       C    C        C

输出理论上应该是这样的：

dfmolecule      refpos  syn?    refbase 0.1288  1304    09BKT076207
NC_011353       68      NA            1       1    1        1
NC_011353       255     NSYN          1       1    1        1
NC_011353       493     NSYN          1       1    1        2
NC_011353       514     NSYN          1       1    1        1
NC_011353       1790    SYN           1       1    1        1
NC_011353       1798    NSYN          1       1    1        2
NC_011353       2015    SYN           1       1    1        1
NC_011353       2345    SYN           1       1    1        1
NC_011353       2655    NSYN          1       1    1        1
NC_011353       2906    NSYN          1       1    1        1

这将帮助我更好地可视化 SNP，并允许我对每行最常见的等位基因变化进行排名。我不知道从哪里开始这就是我发帖的原因。我所做的代码只是将基数转换为数字。'refbase' 需要始终为 '1' 并且当 python 读取与跨行中的 ref 碱基不同的碱基时，它会用 '2' 替换该行中第二个最常见的等位基因的碱基。我希望那更清楚一点。

我的代码新代码，现在只需要弄清楚如何按频率对等位基因变化进行排名？：

import csv
import pdb
import os
import sys


if len(sys.argv) != 2:

        exit("Need arg <snp file>")

snp_file = sys.argv[1]

wtf = csv.writer(open('/users/new_snp.txt' , 'w'), delimiter='\t')




newf = list(csv.reader(open(snp_file,'rU'), delimiter='\t'))



#--------------------------------------------------------------
# Returns an array of tuples with ('A', 8)
# where the letter is the nucleotide and the number
# is the amount of times a letter is present in a row
#--------------------------------------------------------------

def refbase_count(r):

# This is a blank hash to keep count of occurances
# of the alleles

    rep = {'A':0, 'T':0, 'G':0,'C': 0}


    for i in r:

        rep[i] += 1


    # sort before returning
    import operator

    sorted_rep = sorted(rep.iteritems(), key=operator.itemgetter(1))

    # Want them with the most frequent first
    sorted_rep.reverse()

    return sorted_rep

 #--------------------------------------------------------------

 # print the top row outside of the loop
 print newf[0]
 wtf.writerow(newf[0])

 for row in newf[1:]:

    #rep = refbase_count(row[4:])

 for index, val in enumerate( row[4:] ):

        # If the refbase (in index 3) is equal to the
        # value at a given spot, then we give it a new value of 1
        # otherwise, it's a 2
        if row[3] == val:

            row[index+4] = 1

        else:

             row[index+4] = 2



print row
wtf.writerow(row)

score 0 · Accepted Answer

您应该发布您尝试过的内容以及您认为输出应该是什么样子。也许也花一些时间来澄清你的问题。

如果我理解正确，您想按列排名refbase吗？为此使用

In [38]: rnk = df.groupby('refbase').apply(lambda x: np.size(x)).rank(ascending=False)

In [39]: rnk
Out[39]: 
refbase
A          2
C          1
G          4
T          3
dtype: float64

然后，您可以创建一个新列，其排名基于：

In [40]: df['refpos_rank'] = df.refbase.replace(rnk.to_dict())

In [41]: df
Out[41]: 
  dfmolecule  refpos  syn? refbase 0.1288 1304 09BKT076207 refpos_rank
0  NC_011353      68   NaN       A      A    A           A           2
1  NC_011353     255  NSYN       A      A    A           A           2
2  NC_011353     493  NSYN       T      T    T           C           3
3  NC_011353     514  NSYN       C      C    C           C           1
4  NC_011353    1790   SYN       G      G    G           G           4
5  NC_011353    1798  NSYN       A      A    A           T           2
6  NC_011353    2015   SYN       C      C    C           C           1
7  NC_011353    2345   SYN       T      T    T           T           3
8  NC_011353    2655  NSYN       C      C    C           C           1
9  NC_011353    2906  NSYN       C      C    C           C           1

这可以写入一个制表符分隔的文件df.to_csv(path, sep='\t')

python-2.7 - 如何使用 python 将制表符分隔格式的 SNP 矩阵转换为数字？

1 回答 1

Related

Reference