perl - 如何解析文件以进行匹配，并在 Perl 中打印匹配字符串之前的字符串？

Question

我正在尝试解析 GBK 文件。基本上，我需要返回与模式匹配的基因的基因座标签和产品名称。因此，如果我想搜索所有预测的基因产物的主题，搜索词“预测”将返回：

/product="predicted semialdehyde dehydrogenase"
/locus_tag="ECDH10B_2481"

我已经能够返回，/product但我无法弄清楚如何“向后”解析以获取/locus_tag.

这是我到目前为止所拥有的：

my $fasta_file = 'example.txt';
open(INPUT, $fasta_file) || die "ERROR: can't read input FASTA file: $!";
while ( <INPUT> ) {
     if(/predicted/){
            print $_;
     }
}

> 示例.txt

gene            complement(2525423..2526436)
                 /gene="usg"
                 /locus_tag="ECDH10B_2481"
 CDS             complement(2525423..2526436)
                 /gene="usg"
                 /locus_tag="ECDH10B_2481"
                 /codon_start=1
                 /transl_table=11
                 /product="predicted semialdehyde dehydrogenase"
                 /protein_id="ACB03477.1"
                 /db_xref="GI:169889770"
                 /db_xref="ASAP:AEC-0002184"
                 /translation="MSEGWNIAVLGATGAVGEALLETLAERQFPVGEIYALARNESAG
                 EQL"
 gene            complement(2526502..2527638)
                 /gene="pdxB"
                 /locus_tag="ECDH10B_2482"
 CDS             complement(2526502..2527638)
                 /gene="pdxB"
                 /locus_tag="ECDH10B_2482"
                 /codon_start=1
                 /transl_table=11
                 /product="erythronate-4-phosphate dehydrogenase"
                 /protein_id="ACB03478.1"
                 /db_xref="GI:169889771"
                 /db_xref="ASAP:AEC-0002185"
                 /translation="MKILVDENMPYARDLFSRLGEVTAVPGRPIPVAQLADADALMVR
                 SVTKVNESLLAGKPIKFVGTATAGTDHVDEAWLKQAGIGFSAAP"

score 1 · Accepted Answer

你不应该“向后解析”。你的/locus标签是一个事件，匹配是另一个。你的逻辑应该运行

您捕获每个基因座标记并存储它们
匹配时，将 locus 标签存储在 keeper 列表中
当您存储下一个时，最后一个基因座标记将自动被破坏。

score 1 · Accepted Answer

只要记住遇到的最后一个基因座标签，如果预测到就打印出来：

#!/usr/bin/perl
use warnings;
use strict;

my $fasta_file = 'example.txt';
open my $INPUT, '<', $fasta_file or die "ERROR: can't read input FASTA file: $!";

my $locus_tag;
while (<$INPUT>) {
    if (/locus_tag/) {
        $locus_tag = $_;
    } elsif (/predicted/) {
        print;
        print $locus_tag;
    }
}

score 0 · Accepted Answer

向后解析非常困难。通过解析每个完整的条目然后确定它是否匹配，您会得到更好的服务。现在这需要做更多的工作，但是当您想对基因数据进行其他操作时，它会非常有用。

我在下面使用的方法构建了%entry. 当它看到下一个“基因”行时，它会处理该条目，在这种情况下检查产品匹配，并为下一个清除它。

我已将DATA文件句柄用于测试目的，它会读取__DATA__行后的所有内容。

#!/usr/bin/env perl
use v5.10;
use strict;
use warnings;

my %entry;
while(my $line = <DATA>) {
    # new entry, process the previous one and clear it
    if( $line =~ m{^ gene \s+ complement \( (.*) \) }x ) {
        process_entry(\%entry) if keys %entry;
        %entry = ( complement => $1 );
    }
    elsif( $line =~ m{^CDS \s+ }x ) {
        # ignore CDS lines for now
    }
    elsif( $line =~ m{^\s+/(\w+)=(.*)} ) {
        $entry{$1} = $2;
    }
    else {
        warn "Unknown line $line";
    }
}

# Process the last one.
process_entry(\%entry) if keys %entry;

sub process_entry {
    my $entry = shift;

    say "MATCH! $entry->{locus_tag}" if $entry->{product} =~ /predicted/;

    return;
}


__DATA__
gene            complement(2525423..2526436)
                /gene="usg"
                /locus_tag="ECDH10B_2481"
CDS             complement(2525423..2526436)
                /gene="usg"
                /locus_tag="ECDH10B_2481"
                /codon_start=1
                /transl_table=11
                /product="predicted semialdehyde dehydrogenase"
                /protein_id="ACB03477.1"
                /db_xref="GI:169889770"
                /db_xref="ASAP:AEC-0002184"
                /translation="MSEGWNIAVLGATGAVGEALLETLAERQFPVGEIYALARNESAGEQL"
gene            complement(2526502..2527638)
                /gene="pdxB"
                /locus_tag="ECDH10B_2482"
CDS             complement(2526502..2527638)
                /gene="pdxB"
                /locus_tag="ECDH10B_2482"
                /codon_start=1
                /transl_table=11
                /product="erythronate-4-phosphate dehydrogenase"
                /protein_id="ACB03478.1"
                /db_xref="GI:169889771"
                /db_xref="ASAP:AEC-0002185"
                /translation="MKILVDENMPYARDLFSRLGEVTAVPGRPIPVAQLADADALMVRSVTKVNESLLAGKPIKFVGTATAGTDHVDEAWLKQAGIGFSAAP"

或者，CPAN 上有几个Fasta 阅读器，包括Bio::SeqReader::Fasta和Bio::DB::Fasta。

perl - 如何解析文件以进行匹配，并在 Perl 中打印匹配字符串之前的字符串？

3 回答 3

Related

Reference