我正在寻找一个好的库来使用 R 提取 genbank (gbk) 文件的信息。
这是gbk文件的常见结构
gene complement(1..1002)
/gene="bla"
/locus_tag="VV1_RS00005"
/old_locus_tag="VV1_0001"
CDS complement(1..1002)
/gene="bla"
/locus_tag="VV1_RS00005"
/old_locus_tag="VV1_0001"
/inference="COORDINATES: similar to AA
sequence:RefSeq:WP_011078129.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="class A beta-lactamase"
/protein_id="WP_011078129.1"
/translation="MERFMNRSIALCFTLLISSFVPIQPAVANEHNFKDVSQKLETIS
QRLVGRIGVAAQEIGSGERITVNGDEMFVMASTYKVAIAVALLERIDKGELKLSDLID"
gene complement(1131..2111)
/locus_tag="VV1_RS00010"
/old_locus_tag="VV1_0002"
CDS complement(1131..2111)
/locus_tag="VV1_RS00010"
/old_locus_tag="VV1_0002"
/inference="COORDINATES: similar to AA
sequence:RefSeq:WP_017029542.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="GTP-binding protein"
/protein_id="WP_043920887.1"
/translation="MSKKPIPVTILAGFLGAGKTTLLNHILTNANGMRMAVIVNDFGS
INVDAELVKSESDNMISLENGCVCCNLAEGLVVSVMRLLALEQRPDHIVVETSGISEP"
所以我想提取与 CDS 相关的信息,比如
>gene|product|locus_tag|old_locus_tag|sequence:RefSeq|protein_id|complement
translation
对于第一个 CDS 将类似于:
>bla|class A beta-lactamase|VV1_RS00005|VV1_0001|WP_011078129.1|1:1002
MERFMNRSIALCFTLLISSFVPIQPAVANEHNFKDVSQKLETISQRLVGRIGVAAQEIGSGERITVNGDEMFVMASTYKVAIAVALLERIDKGELKLSDLID
并为其余的 CDS 执行此操作,可能是数千个!
抱歉,我不知道如何在 R 中做到这一点
谢谢