我有一个这样的 fastq 文件(文件的一部分):
@A80HNBABXX:4:1:1344:2224#0/1
AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG
+
\\YYWX\PX^YT[TVYaTY]^\^H\`^`a`\UZU__TTbSbb^\a^^^`[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@A80HNBABXX:4:1:1515:2211#0/1
TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA
+
ee^e^\`ad`eeee\dd\ddddYeebdd\ddaYbdcYc`\bac^YX[V^\Ybb]]^bdbaZ]ZZ\^K\^]VPNME][`_``Ubb_bYddZbbbYbbYT^_
@A80HNBABXX:4:1:1538:2220#0/1
CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT
+
fff^fd\c^d^Ycac`dcdcded`effdfedb]beeeeecd^ddccdddddfff`eaeeeffdTecacaLV[QRPa\\a\`]aY]ZZ[XYcccYcZ\\]Y
@A80HNBABXX:4:1:1666:2222#0/1
CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT
+
deeee`bbcddddad\bbbbeee\ecYZcc^dd^ddd\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBB
FASTQ 文件每个序列使用四行。第 1 行以“@”字符开头,后跟序列标识符。第 2 行是 DNA 序列字母。第 3 行以“+”字符开头。第 4 行编码第 2 行中序列的质量值(“+”之后和下一个“@”之前的部分,并且必须包含与序列中的字母相同数量的符号。
我想把fastq文件读成这样的字典(关键是DNA序列,值是质量值,“@”和“+”开头的行可以去掉):
{'AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG':'\YYWX\PX^YT[TVYaTY]^\^H`^a\UZU__TTbSbb^\a^^^[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB',
'CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT':'fff^fd\c^d^Ycacdcdcdedeffdfedb]beeeeecd^ddccdddddfffeaeeeffdTecacaLV[QRPa\a`]aY]ZZ[XYcccYcZ\]Y ',
....}
我编写了以下代码,但它没有给我想要的东西。谁能帮我修复/改进我的代码?
class fastq(object):
def __init__(self,filename):
self.filename = filename
self.__sequences = {}
def parse_file(self):
symbol=['@','+']
"""Stores both the sequence and the quality values for the sequence"""
f = open(self.filename,'rU')
for lines in self.filename:
if symbol not in lines.startwith()
data = f.readlines()
return data