0

我的输入文件以这种方式包含以下信息

>V063O:34:49 length=314
GAGATGACTCCCAGGGGGGGGGGATGAAACCCAGACCTGGCACCATGGGATCAGCCATTC
CATCTTGACCAAAGGGGGGGGGGAAAGAAAGTGTAATTAATAAAGTACAGTGGCAGAGAG
AGTTCAAATAGTTGCGAGTCTACTCTGGAGGTTGCTGTTGTGCTAAGCTTCAGGTTATAC
CTTGACCCTACCATACCCCCCAAACCAGGACAATTCCAAGCCCAAATCCGTAAAAGAAAC
ACCTAAGGCAATATATAAGATTCTACAGGTCATACATCTAGACTACTTACTAACAATCCG
TAACAACCTCAGAT
>V063O:35:44 length=104
GCTCTTTTTTTTTTTAGCAAAAACCGTTAGCCAATCCCTACCCAACCCCTGGCACCTGGG
GGGGGGTGCCCGAGCGCCGGTGGGAGAACGGAGGAAACGCACTC

序列(低于 ID 和长度 = 的数据字符串)将受到以下正则表达式的影响

 #Search sequence for a combination of 2 values of ACGT that are repeated at least 10 times
        my $regex1 = qr/( ([ACGT]{2}) \2{9,} )/x;
    #Search sequence for a combination of 3 values of ACGT that are repeated at least 7 times
        my $regex2 = qr/( ([ACGT]{3}) \2{6,} )/x;
    #Search sequence for a combination of 4 values of ACGT that are repeated at least 7 times
        my $regex3 = qr/( ([ACGT]{4}) \2{6,} )/x;
for my $regex ($regex1, $regex2, $regex3) {
    next unless $seq1 =~ $regex;
    printf "Matched %s exactly %d times\n", $2, length($1)/length($2);
    printf "Length of sequence: $number \n";
}

当前,这会将仅包含单个序列的示例文本文件的结果返回到命令行

我需要能够将在文件和上面的正则表达式中找到的以下元素打印到文本文件中每个序列的单个文件中(因此在文本文件中找到的所有序列都有一个文件)。

ID (example: V0630:34:49) , The elements that are repeating (example; GCT), the number of repeats (example; 13), and the length of the entire sequence.

条件是 BioPerl 不是一个选项(用户不是 Perl 精通的,所以这意味着最终用户尽可能容易而不必下载模块)并且输入文件本质上非常大(300MB 或更多)。

处理这个问题的最佳方法是什么?

4

1 回答 1

0

从您的评论看来,这可能是家庭作业。你是打算自己解决这个问题吗?

use strict;
use warnings;
use autodie;

my @regexes = (
  qr/( ([ACGT]{2}) \2{9,} )/x,
  qr/( ([ACGT]{3}) \2{6,} )/x,
  qr/( ([ACGT]{4}) \2{6,} )/x,
);

open my $fh, '<', 'data.txt';

my $seq;
my $id;

while (<$fh>) {

  if (/^>(\S+)/) {
    process_sequence($id, $seq) if $seq;
    $id = $1;
    $seq = ''
  }
  else {
    chomp;
    $seq .= $_;
  }
}
process_sequence($id, $seq) if $seq;

sub process_sequence {
  my ($id, $seq) = @_;
  for my $regex (@regexes) {
      next unless $seq =~ $regex;
      printf "Sequence ID %s matched %s exactly %d times\n", $id, $2, length($1)/length($2);
      printf "Length of sequence: %s \n", length $seq;
      print "\n";
  }
}
于 2013-02-27T03:23:51.707 回答