python - 重命名 gffile 中的名称 ID。

Question

我有一个 gff 文件，如下所示：

contig1 loci    gene    452050  453069  15  -   .   ID=dd_g4_1G94;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_g4_1G94.1;Parent=dd_g4_1G94
contig1 loci    exon    452050  452543  .   -   .   ID=dd_g4_1G94.1.exon1;Parent=dd_g4_1G94.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_g4_1G94.1.exon2;Parent=dd_g4_1G94.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_g4_1G94.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_g4_1G94.2.exon1;Parent=dd_g4_1G94.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_g4_1G94.2.exon2;Parent=dd_g4_1G94.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_g4_1G94.2.exon3;Parent=dd_g4_1G94.2
###

我希望重命名 ID 名称，从 0001 开始，这样对于上述基因，条目是：

contig1 loci    gene    452050  453069  15  -   .   ID=dd_0001;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_0001.1;Parent=dd_0001
contig1 loci    exon    452050  452543  .   -   .   ID=dd_0001.1.exon1;Parent=dd_0001.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_0001.1.exon2;Parent=dd_0001.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_0001.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_0001.2.exon1;Parent=dd_0001.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_0001.2.exon2;Parent=dd_0001.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_0001.2.exon3;Parent=dd_0001.2

上面的例子只是一个基因条目，但我希望重命名所有基因，以及它们对应的 mRNA/外显子，从 ID = dd_0001 开始连续。任何有关如何执行此操作的提示将不胜感激。

score 1 · Accepted Answer

需要打开文件，然后逐行替换id。
这是文件 I/O和str.replace()的文档参考。

gff_filename = 'filename.gff'
replace_string = 'dd_g4_1G94'
replace_with = 'dd_0001'

lines = []
with open(gff_filename, 'r') as gff_file:
    for line in gff_file:
        line = line.replace(replace_string, replace_with)
        lines.append(line)

with open(gff_filename, 'w') as gff_file:
    gff_file.writelines(lines)

在 Windows 10、Python 3.5.1 中测试，这是可行的。

要搜索 id，您应该使用regex。

import re

gff_filename = 'filename.gff'
replace_with = 'dd_{}'
re_pattern = r'ID=(.*?)[;\.]'

ids  = []
lines = []
with open(gff_filename, 'r') as gff_file:
    file_lines = [line for line in gff_file]

for line in file_lines:
    matches = re.findall(re_pattern, line)
    for found_id in matches:
        if found_id not in ids:
            ids.append(found_id)

for line in file_lines:
    for ID in ids:
        if ID in line:
            id_suffix = str(ids.index(ID)).zfill(4)
            line = line.replace(ID, replace_with.format(id_suffix))
    lines.append(line)

with open(gff_filename, 'w') as gff_file:
    gff_file.writelines(lines)

还有其他方法可以做到这一点，但这是非常强大的。

python - 重命名 gffile 中的名称 ID。

1 回答 1

Related

Reference