合作者在一个 word 文档中给了我一些 DNA 序列,我想将其转换为一个文件中的一系列 fasta 序列。
我已经把它变成了一个文本文件,我想用正则表达式来提取基因名称和序列:
use warnings;
use strict;
die "usage: make_fasta.pl <sequence file>" unless (@ARGV == 1);
my $seq_filename = shift;
my $fasta_db_name = $seq_filename . "_db.fa";
open(my $seq_file, '<', $seq_filename)
or die "can't open file $seq_filename, $!";
open(my $fasta_file, '>', $fasta_db_name)
or die "can't open file $fasta_db_name, $!";
while (my $line = <$seq_file>) {
chomp $line;
if ($line =~ /^[ATCG]+$/) { # if the line is entirely DNA seqence
print $fasta_file "$line\n";
} elsif ($line =~ /Full-length (\w+) cDNA/) { # if the line has gene info
print $fasta_file ">$1\n";
} else {
next;
}
}
但这只是给了我第一个基因的名字。很明显,我在 DNA 正则表达式上做错了,但我终其一生都无法解决。在我看来,这与我在本网站和其他网站上发现的其他建议的 DNA 测试完全相同。
我试图解析的文件配置如下:
Collaborators name
title of gene set
Full-length clock cDNA coding sequence
ATGGTAGGATGTGTAATGCGTACGTGATCGT
Full-length per cDNA coding sequence
ATGCTAGCTACGTACGTAGCTACGTAGTACG
我希望输出是一个fasta文件,所以:
>clock
ATGGTAGGATGTGTAATGCGTACGTGATCGT
>per
ATGCTAGCTACGTACGTAGCTACGTAGTACG
实际输入文件的前几行是:
Dr Lin Zhang (Leicester University 10/2012)
Canonical clock genes
Full-length per cDNA coding seq (3693bp)
ATGGACACAGGAACACCCCATGAAGATGTGCCCTCAGAGGACCACACCTTGGAAGAAGGGGACAGCAAGAACCCCTCGTGCCAGCAAGAGTCAGCCTACGGCTCCCTCGAGTCATCCTCCAATGGACAGTCTCAGAAAAGTTTCGGAGGAAGTGGAAGCAAAAGCTTAAATAGTGGTTCGAGTCACAGCAGCGGCTTTGGGGACCAAAATGATTTCAAGGGTATCCATCTTCACGAAGCGAAACACATAGCGTTGAAGAAGAAGAAAACTGGGAAAGGAGGTGAAAAGGTAGCAGAAATCCCCTTTCAAACTGCCTCTGAGGCAGAACTGTCCTCCAAAGGAAACGAAACAGAAAAGGAGAAAGAAACAAGCCTCGAGGAGTCTCCTGCTGCAAAAGAGGAAGCAATTATCGAAAAGGAGTCTCGTTACATCCACCCGAGGAACT