0

我从一个 genbank 文件中提取了一个序列,该文件由具有 60 个碱基的单行字符串组成(末尾有一个 \n)。如何使用 perl 修改序列,以便使用 regex 而不是 bioperl 为每行打印 120 个碱基。原始格式:

    1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
   61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
  121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
  181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
  241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
  301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
  361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
  421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
  481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat 

我只设法将它们变成长度为 60 个字符的字符串。仍在试图弄清楚如何使它们长 120 个字符。

my @lines= <$FH_IN>;
foreach my $line (@lines) {
    if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
            $line=~ s/$1//;
            $line=~ s/ //g;
            print $line;
    }

}

输入示例:

agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat

每个单行字符串有 60 个碱基。

更新(仍然没有给出 120 个碱基长的 seq 行):

my @seq_60;
foreach my $line (@lines) {
        if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
                $line=~ s/$1//;
                $line=~ s/ //g;
                push (@seq_60, $line);
        }
}

my @output;
for (my $pos= 0; $pos< @seq_60; $pos+= 2) {
        push (@output, $seq_60[$pos] . $seq_60[$pos+1]);
}

print @output;
4

2 回答 2

0

怎么样:

s/(^|\n)([^\n]{60})\n/$1$2/g

在行动:

use strict;
use warnings;
use 5.014;

my $str = q/agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat/;

$str =~ s/(^|\n)([^\n]{60})\n/$1$2/g;
say $str;

输出:

agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggccatccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat

解释:

(^|\n)      : group 1, start of string or line break
(           : start group 2
  [^\n]{60} : anything that is not a line break 60 times
)           : end group 2
\n          : line break

根据评论编辑:

成对连接线:

my @out;
for (my $i = 0; $i < @arr; $i += 2) {
    chomp($in[$i]);
    push @out, $in[$i] . $in[$i+1];
}
于 2014-10-18T09:56:20.177 回答
0

您可以同时进行行读取和写入,并将前一行存储在变量中。有关正在发生的事情的解释,请参见代码注释:

my $prev;
while (<$FH_IN>) {
    next unless /\w/; # make sure the lines have some content
    # remove the line endings
    chomp;
    # chop off the first 6 characters (the base numbers) - format is 4 chars that
    # can be numbers or spaces, a digit, and a space
    $_ =~ s/^[\s\d]{4}\d\s//g;
    # remove the spaces between bases
    $_ =~ s/\s//g;
    # have we got a saved line?
    if ($prev) {
        # print out saved line and this line
        print $prev . $_ . "\n";
        # delete the saved line $prev
        $prev = '';
    }
    else {
        # if we don't have a saved line, save this line
        $prev = $_;
    }
}
于 2014-10-18T11:13:01.920 回答