我的输入文件以这种方式包含以下信息
>V063O:34:49 length=314
GAGATGACTCCCAGGGGGGGGGGATGAAACCCAGACCTGGCACCATGGGATCAGCCATTC
CATCTTGACCAAAGGGGGGGGGGAAAGAAAGTGTAATTAATAAAGTACAGTGGCAGAGAG
AGTTCAAATAGTTGCGAGTCTACTCTGGAGGTTGCTGTTGTGCTAAGCTTCAGGTTATAC
CTTGACCCTACCATACCCCCCAAACCAGGACAATTCCAAGCCCAAATCCGTAAAAGAAAC
ACCTAAGGCAATATATAAGATTCTACAGGTCATACATCTAGACTACTTACTAACAATCCG
TAACAACCTCAGAT
>V063O:35:44 length=104
GCTCTTTTTTTTTTTAGCAAAAACCGTTAGCCAATCCCTACCCAACCCCTGGCACCTGGG
GGGGGGTGCCCGAGCGCCGGTGGGAGAACGGAGGAAACGCACTC
序列(低于 ID 和长度 = 的数据字符串)将受到以下正则表达式的影响
#Search sequence for a combination of 2 values of ACGT that are repeated at least 10 times
my $regex1 = qr/( ([ACGT]{2}) \2{9,} )/x;
#Search sequence for a combination of 3 values of ACGT that are repeated at least 7 times
my $regex2 = qr/( ([ACGT]{3}) \2{6,} )/x;
#Search sequence for a combination of 4 values of ACGT that are repeated at least 7 times
my $regex3 = qr/( ([ACGT]{4}) \2{6,} )/x;
for my $regex ($regex1, $regex2, $regex3) {
next unless $seq1 =~ $regex;
printf "Matched %s exactly %d times\n", $2, length($1)/length($2);
printf "Length of sequence: $number \n";
}
当前,这会将仅包含单个序列的示例文本文件的结果返回到命令行
我需要能够将在文件和上面的正则表达式中找到的以下元素打印到文本文件中每个序列的单个文件中(因此在文本文件中找到的所有序列都有一个文件)。
ID (example: V0630:34:49) , The elements that are repeating (example; GCT), the number of repeats (example; 13), and the length of the entire sequence.
条件是 BioPerl 不是一个选项(用户不是 Perl 精通的,所以这意味着最终用户尽可能容易而不必下载模块)并且输入文件本质上非常大(300MB 或更多)。
处理这个问题的最佳方法是什么?