0

非常抱歉在几个问题上打扰您,但我需要解决它...

我想从包含字符串的文件中提取几个子字符串,方法是使用另一个文件,其中包含我要提取的每个子字符串的开头和结尾。第一个文件是这样的:

>scaffold30     24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................

(字符串必须是文件中除第一行之外的所有内容),坐标文件如下:

44801988    44802104
44846151    44846312
45620133    45620274
45640443    45640543
45688249    45688358
45729531    45729658
45843362    45843490
46066894    46066996
46176337    46176464
.....................

我的脚本是这样的:

my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];

#finds  subsequences: fasta files



open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;

while(<INFILE1>) {
    if ($_ !~ m/^>/) {

    local $/ = undef;
    my $var = <INFILE1>;

    open INFILE, $coords_file or die "Could not open $coords_file: $!";
           my @cline = <INFILE>;
    foreach my $cline (@cline) {
    print "$cline\n";
            my@data = split('\t', $cline);
            my $start = $data[0];
            my $end = $data[1];
            my $offset = $end - $start;
           $count++;
           my $sub = substr ($var, $start, $offset);
           print ">conserved $count\n";
           print "$sub\n";

    }
    close INFILE;
    }
}

当我运行它时,它看起来只进行了一次迭代,并打印了第一个文件的开头。似乎 foreach 循环不起作用。substr 似乎也不起作用。当我退出打印 cline 以检查循环时,它会打印带有坐标的文件的所有行。

如果我变得烦人我很抱歉,但我必须完成它,我有点绝望......

再次感谢你。

4

2 回答 2

2

这条线

local $/ = undef;

整个封闭块的更改$/,其中包括您在第二个文件中读取的部分。$/是输入记录分隔符,它本质上定义了“行”是什么(默认为换行符,详见perldoc perlvar)。当您使用<>,从文件句柄中读取时,$/用于确定停止读取的位置。例如,以下程序依赖于默认的换行行为,因此只读取到第一个换行符:

my $foo = <DATA>;
say $foo;
# Output:
# 1

__DATA__
1
2
3

而这个程序一直读取到 EOF:

local $/;
my $foo = <DATA>;
say $foo;
# Output:
# 1
# 2
# 3

__DATA__
1
2
3

这意味着您的@cline数组只有一个元素,它是一个包含整个坐标文件文本的字符串。你可以看到这个使用Data::Dumper

use Data::Dumper;

print Dumper(\@cline);

在您的情况下,它将输出如下内容:

$VAR1 = [
          '44801988    44802104
44846151    44846312
45620133    45620274
45640443    45640543
45688249    45688358
45729531    45729658
45843362    45843490
46066894    46066996
46176337    46176464
'
        ];

注意你的数组(在这种情况下是一个 arrayref),由[and描绘],只包含一个元素,它是一个包含换行符的字符串(由单引号描绘)。

让我们浏览一下代码的相关部分:

while(<INFILE1>) {
    if ($_ !~ m/^>/) {
        # Enable localized slurp mode. Stays in effect until we leave the 'if'
        local $/ = undef;

        # Read the rest of INFILE1 into $var (from current line to EOF)
        my $var = <INFILE1>;

        open INFILE, $coords_file or die "Could not open $coords_file: $!";

        # In list context, return each block until the $/ character as a
        # separate list element. Since $/ is still undef, this will read
        # everything until EOF into our first list element, resulting in
        # a one-element array
        my @cline = <INFILE>;

        # Since @cline only has one element, the loop only has one iteration
        foreach my $cline (@cline) {

作为旁注,您的代码可以稍微清理一下。您为文件句柄选择的名称有待改进,无论如何您都应该使用词法文件句柄(以及 的三参数形式open):

open my $chromosome_fh,  "<", $ARGV[0] or die $!;
open my $coordinates_fh, "<", $ARGV[1] or die $!;

此外,在这种情况下,您不需要嵌套循环,它只会使您的代码更加复杂。首先将染色体文件的相关部分读入一个变量(命名为比 更有意义var):

# Get rid of the `local $/` statement, we don't need it
my $chromosome;
while (<$chromosome_fh>) {
    next if /^>/;
    $chromosome .= $_;
}

然后读入你的坐标文件:

my @cline = <$coordinates_fh>;

或者,如果您只需要使用一次坐标文件的内容,请使用 while 循环处理每一行:

while (<$coordinates_fh>) {
    # Do something for each line here
}
于 2013-08-29T15:46:55.130 回答
1

正如“ThisSuitIsBlackNot”所建议的那样,您的代码可以稍微清理一下。这是一个可能的解决方案,可能是您想要的。

#!/usr/bin/perl
use strict;
use warnings;

my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];

#finds  subsequences: fasta files

open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;

<INFILE1>; # get rid of the first line - '>scaffold30     24194'

while(<INFILE1>) {
    chomp;
    $fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";

open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;

while(<INFILE>) {
    my ($start, $end) = split;

    # Or, should this be: my $offset = $end - ($start - 1);
    # That would include the start fasta
    my $offset = $end - $start;

    $count++;
    my $sub = substr ($fasta, $start, $offset);
    print ">conserved $count\n";
    print "$sub\n";
}
close INFILE or die "Could not close '$coords_file'. $!";
于 2013-08-29T19:12:51.817 回答