regex - 使用 perl 从 Genbank 文件中提取期刊标题，而不使用 $1、$2 等

Question

这是我输入的 Genbank 文件的一部分：

LOCUS       AC_000005              34125 bp    DNA     linear   VRL 03-OCT-2005
DEFINITION  Human adenovirus type 12, complete genome.
ACCESSION   AC_000005 BK000405
VERSION     AC_000005.1  GI:56160436
KEYWORDS    .
SOURCE      Human adenovirus type 12
  ORGANISM  Human adenovirus type 12
            Viruses; dsDNA viruses, no RNA stage; Adenoviridae; Mastadenovirus.
REFERENCE   1  (bases 1 to 34125)
  AUTHORS   Davison,A.J., Benko,M. and Harrach,B.
  TITLE     Genetic content and evolution of adenoviruses
  JOURNAL   J. Gen. Virol. 84 (Pt 11), 2895-2908 (2003)
   PUBMED   14573794

我想提取期刊名称，例如 J. Gen. Virol。（不包括期号和页数）

这是我的代码，它没有给出任何结果，所以我想知道出了什么问题。我确实在 $1、$2 等中使用了括号......虽然它有效，但我的导师告诉我尝试不使用该方法，而是使用 substr。

foreach my $line (@lines) {
    if ( $line =~ m/JOURNAL/g ) {
        $journal_line = $line;
        $character = substr( $line, $index, 2 );
        if ( $character =~ m/\s\d/ ) {
            print substr( $line, 12, $index - 13 );
            print "\n";
        }
        $index++;
    }
}

score 4 · Accepted Answer

另一种方法是利用BioPerl，它可以解析 GenBank 文件：

#!/usr/bin/perl

use strict;
use warnings;

use Bio::SeqIO;

my $io=Bio::SeqIO->new(-file=>'AC_000005.1.gb', -format=>'genbank');
my $seq=$io->next_seq;

foreach my $annotation ($seq->annotation->get_Annotations('reference')) {
    print $annotation->location . "\n";
}

如果您使用保存在名为 AC_000005.1.gb 的文件中的AC_000005.1运行此脚本，您将获得：

J. Gen. Virol. 84 (PT 11), 2895-2908 (2003)
J. Virol. 68 (1), 379-389 (1994)
J. Virol. 67 (2), 682-693 (1993)
J. Virol. 63 (8), 3535-3540 (1989)
Nucleic Acids Res. 9 (23), 6571-6589 (1981)
Submitted (03-MAY-2002) MRC Virology Unit, Church Street, Glasgow G11 5JR, U.K.

score 1 · Accepted Answer

与 match 和 using 相比substr，使用单个正则表达式来捕获整JOURNAL行并使用括号来捕获表示期刊信息的文本要容易得多：

foreach my $line (@lines) {
    if ($line =~ /JOURNAL\s+(.+)/) {
        print "Journal information: $1\n";
    }
}

正则表达式查找JOURNAL后跟一个或多个空白字符，并且 ( .+) 捕获该行中的其余字符。

要在不使用的情况下获取文本$1，我认为您正在尝试执行以下操作：

if ($line =~ /JOURNAL/) {
    my $ix = length('JOURNAL');
    # variable containing the journal name
    my $j_name;
    # while the journal name is not defined...
    while (! $j_name) {
        # starting with $ix = the length of the word JOURNAL, get character $ix in the string
        if (substr($line, $ix, 1) =~ /\s/) {
            # if it is whitespace, increase $ix by one
            $ix++;
        }
        else {
            # if it isn't whitespace, we've found the text!!!!!
            $j_name = substr($line, $ix);
        }
    }

如果您已经知道左侧列中有多少个字符，则可以执行substr($line, 12)（或其他任何操作）来检索$line从字符 12 开始的子字符串：

foreach my $line (@lines) {
    if ($line =~ /JOURNAL/) {
        print "Journal information: " . substr($line, 12) . "\n";
    }
}

您可以结合使用这两种技术来消除期刊数据中的期号和日期：

if ($line =~ /JOURNAL/) {
    my $j_name;
    my $digit;
    my $indent = 12; # the width of the left-hand column
    my $ix = $indent; # we'll use this to track the characters in our loop
    while (! $digit) {
        # starting with $ix = the length of the indent,
        # get character $ix in the string
        if (substr($line, $ix, 1) =~ /\d/) {
            # if it is a digit, we've found the number of the journal
            # we can stop looping now. Whew!
            $digit = $ix;
            # set j_name
            # get a substring of $line starting at $indent going to $digit
            # (i.e. of length $digit - $indent)
            $j_name = substr($line, $indent, $digit-$indent);
        }
        $ix++;
    }
    print "Journal information: $j_name\n";
}

我认为从 Pubmed API 获取数据会更容易！;)

regex - 使用 perl 从 Genbank 文件中提取期刊标题，而不使用 $1、$2 等

2 回答 2

Related

Reference