python - Biopython SeqIO：如何编写修改后的 SeqRecord 标头

Question

我想我会尝试使用 Biopython 来挽救合作者提供的一些容易损坏的 fastq 文件。@我只需要修改包含某个子字符串的标题行（以开头）。但是，以下代码创建的新 fastq 文件并没有改变。毫无疑问，我遗漏了一些明显的东西。

编写修改后的 fastq SeqRecord 的正确方法是什么？

import os, sys
from Bio import SeqIO

path_to_reads = sys.argv[1]
if not os.path.exists(path_to_reads + '/fixed'):
    os.mkdir(path_to_reads + '/fixed')

fwd_fastqs = [fn for fn in os.listdir(path_to_reads) if fn.endswith('_F.fastq')]
rev_fastqs = [fn for fn in os.listdir(path_to_reads) if fn.endswith('_R.fastq')]
fastq_pairs = zip(fwd_fastqs, rev_fastqs)

for fastq_pair in fastq_pairs:
    with open(path_to_reads + '/' +  fastq_pair[0], 'rU') as fwd_fastq:
        with open(path_to_reads + '/fixed/' +  fastq_pair[0], 'w') as fixed_fwd_fastq:
            fixed_fwd_records = []
            for fwd_record in SeqIO.parse(fwd_fastq, 'fastq'):
                fwd_record.name = fwd_record.name.replace('/2','/1')
                fixed_fwd_records.append(fwd_record)
            SeqIO.write(fixed_fwd_records, fixed_fwd_fastq, 'fastq')
    # ...

输入数据（两条记录，标题行以开头@）：

@MISEQ01:115:000000000-A8FBM:1:1112:18038:15085/1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGCCAATCATCTCGTATGCCGTCTTCTGCTTG
+
AAAAAAAAAF4CGGGGGAGGGFGHBHGHC5AGEGFFHGA3F355FGG223FFAEE0GCGA55BAB
@MISEQ01:115:000000000-A8FBM:1:1101:20590:9966/2
GATCACTCCCCTGTGAGGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGA
+
1>A111DFBA1CFA1FFG1BFGB1D1DGFGH3GECA0ABFFG?E///DDGFBB0FEAEEFBDAB2

score 2 · Accepted Answer

我不是 python 人，但我从事生物信息学，所以我了解文件格式。我可以解释发生了什么以及为什么：

查看 BioPython Bio.SeqIO.QualityIO fastq writer 代码，BioPythonSeqRecord对象的工作方式是它有 2 个字段来存储部分定义。一个name和一个description。通常人们会认为它会像 FASTA 文件一样工作，并在空白处拆分定义，名称为左拆分，描述为右拆分中的可选注释。然而，BioPython 解析器将定义的副本作为描述。我的猜测是这是一个 hack（连同我在下面解释的编写器代码）来绕过其中有空格的 CASAVA 1.8 读取。

当作者写出记录时，它会检查名称和描述是否匹配，如果它们不匹配，那么它会写出description假定为 CASAVA 1.8 读取的行，我猜...

由于您只更改了name部分，因此匹配测试失败，因此使用未更改的描述。当您空白时，description作者正确地使用了该name字段。

score 0 · Accepted Answer

我找到了一个解决方案，我认为这不是很明显。可以通过SeqRecord.id、SeqRecord.name和中的任何一个访问读取标题行SeqRecord.description。

毫无疑问，它们之间存在细微差别，但我浏览了 SeqIO 文档，并没有明确提及它们。如果我添加fwd_record.description = ''，我的脚本会按预期/1替换的发生率。/2

所以，工作代码：

for fastq_pair in fastq_pairs:
    with open(path_to_reads + '/' +  fastq_pair[0], 'rU') as fwd_fastq:
        with open(path_to_reads + '/fixed/' +  fastq_pair[0], 'w') as fixed_fwd_fastq:
            fixed_fwd_records = []
            for fwd_record in SeqIO.parse(fwd_fastq, 'fastq'):
                fwd_record.name = fwd_record.name.replace('/2','/1')
                fwd_record.description = ''
                fixed_fwd_records.append(fwd_record)
            SeqIO.write(fixed_fwd_records, fixed_fwd_fastq, 'fastq')

python - Biopython SeqIO：如何编写修改后的 SeqRecord 标头

2 回答 2

Related

Reference