我正在尝试将制表符分隔文件中的基数转换为整数。输入文件必须复制,新副本必须有一个“1”作为参考碱基,“2”表示最常见的替代等位基因,“3”表示下一个,依此类推。我是 python 新手,所以欢迎任何帮助,我一直在尝试使用 pandas,因为文件太大了。以下是数据示例:
dfmolecule refpos syn? refbase 0.1288 1304 09BKT076207
NC_011353 68 NA A A A A
NC_011353 255 NSYN A A A A
NC_011353 493 NSYN T T T C
NC_011353 514 NSYN C C C C
NC_011353 1790 SYN G G G G
NC_011353 1798 NSYN A A A T
NC_011353 2015 SYN C C C C
NC_011353 2345 SYN T T T T
NC_011353 2655 NSYN C C C C
NC_011353 2906 NSYN C C C C
输出理论上应该是这样的:
dfmolecule refpos syn? refbase 0.1288 1304 09BKT076207
NC_011353 68 NA 1 1 1 1
NC_011353 255 NSYN 1 1 1 1
NC_011353 493 NSYN 1 1 1 2
NC_011353 514 NSYN 1 1 1 1
NC_011353 1790 SYN 1 1 1 1
NC_011353 1798 NSYN 1 1 1 2
NC_011353 2015 SYN 1 1 1 1
NC_011353 2345 SYN 1 1 1 1
NC_011353 2655 NSYN 1 1 1 1
NC_011353 2906 NSYN 1 1 1 1
这将帮助我更好地可视化 SNP,并允许我对每行最常见的等位基因变化进行排名。我不知道从哪里开始这就是我发帖的原因。我所做的代码只是将基数转换为数字。'refbase' 需要始终为 '1' 并且当 python 读取与跨行中的 ref 碱基不同的碱基时,它会用 '2' 替换该行中第二个最常见的等位基因的碱基。我希望那更清楚一点。
我的代码新代码,现在只需要弄清楚如何按频率对等位基因变化进行排名?:
import csv
import pdb
import os
import sys
if len(sys.argv) != 2:
exit("Need arg <snp file>")
snp_file = sys.argv[1]
wtf = csv.writer(open('/users/new_snp.txt' , 'w'), delimiter='\t')
newf = list(csv.reader(open(snp_file,'rU'), delimiter='\t'))
#--------------------------------------------------------------
# Returns an array of tuples with ('A', 8)
# where the letter is the nucleotide and the number
# is the amount of times a letter is present in a row
#--------------------------------------------------------------
def refbase_count(r):
# This is a blank hash to keep count of occurances
# of the alleles
rep = {'A':0, 'T':0, 'G':0,'C': 0}
for i in r:
rep[i] += 1
# sort before returning
import operator
sorted_rep = sorted(rep.iteritems(), key=operator.itemgetter(1))
# Want them with the most frequent first
sorted_rep.reverse()
return sorted_rep
#--------------------------------------------------------------
# print the top row outside of the loop
print newf[0]
wtf.writerow(newf[0])
for row in newf[1:]:
#rep = refbase_count(row[4:])
for index, val in enumerate( row[4:] ):
# If the refbase (in index 3) is equal to the
# value at a given spot, then we give it a new value of 1
# otherwise, it's a 2
if row[3] == val:
row[index+4] = 1
else:
row[index+4] = 2
print row
wtf.writerow(row)