我有一个基因列表和以下信息:
- 他们的名字'XLOC_0000...'
- 他们所在的基因组支架“支架……”
- 每个要素在其 Scaffold 上的位置(“开始”、“停止”)
我编写了一段 Perl 代码,它在基因组支架中找到每个基因并将其保存到一个文件中。简而言之,首先我将每个基因放入数组哈希中,例如
my %geneID = map { $xloc[$_] => [ $scaffold[$_], $start[$_], $stop[$_] ] } (0 .. $#xloc);
然后我对包含脚手架的 fasta 文件进行哈希处理:
open FASTA, '<', 'genome.fasta' || die "Can't open 'genome.fasta'\n"; #Read in 'fasta' file
my (@head, @sequence);
while (<FASTA>) {
chomp;
push @head, $_ if /^>/;
push @sequence, $_ if /^[A-Z]/;
}
my %scaf;
@scaf{@head} = @sequence; # All scaffolds, as ordered in FH.
然后我分配第一个 HoA 的元素,并使用 substr,在同名的脚手架中找到基因的开始和停止位置
foreach my $xloc (sort keys %geneID) {
print "gene sequence for $xloc is: ";
my $chm = @{$geneID{$xloc}}[0];
my $start = @{$geneID{$xloc}}[1];
my $end = @{$geneID{$xloc}}[2];
my $seq = substr($scaf{$chm},$start-1,$end-($start-1));
print "$seq\n";
}
问题在于,如果我有同名的 xloc,例如 XLOC_00001,则哈希键只取最后一个值。我希望能够向每个哈希添加多个“子值”,使用 substr 找到它们的位置,并在最后将它们连接在一起。
关于如何做到这一点的任何建议?
更新:
这是一个测试示例,显示了我得到的结果:
“基因组”快速文件
>Scaffold1
ONEATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold2
TWOATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold3
THREEATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold4
FOURATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold5
FIVEATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold6
SIXATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold7
SEVENATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold8
EIGHTATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold9
NINEATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
>Scaffold10
TENATCGCGCTTAGTGCAGTACGTAGCTACGTGACTACTGA
%geneID 的键和值:
Key: XLOC_000027 contains the values: >Scaffold1 1 10
Key: XLOC_000037 contains the values: >Scaffold2 1 15
Key: XLOC_000038 contains the values: >Scaffold3 2 9
Key: XLOC_000051 contains the values: >Scaffold4 6 8
Key: XLOC_000077 contains the values: >Scaffold5 2 7
Key: XLOC_000079 contains the values: >Scaffold6 4 16
Key: XLOC_000096 contains the values: >Scaffold7 4 9
Key: XLOC_000100 contains the values: >Scaffold8 3 20
Key: XLOC_000117 contains the values: >Scaffold9 6 8
Key: XLOC_000119 contains the values: >Scaffold10 7 14
结果,将“基因”显示为每个 XLOC 所在支架的子字符串:
gene sequence for XLOC_000027 is: ONEATCGCG
gene sequence for XLOC_000037 is: TWOATCGCGCTTAG
gene sequence for XLOC_000038 is: HREEATCG
gene sequence for XLOC_000051 is: TCGCGCT
gene sequence for XLOC_000077 is: IVEATC
gene sequence for XLOC_000079 is: ATCGCGCTTAGTGCA
gene sequence for XLOC_000096 is: ENATCGCG
gene sequence for XLOC_000100 is: GHTATCGCGCTTAGTGCAG
gene sequence for XLOC_000117 is: TCGCGCT
gene sequence for XLOC_000119 is: GCGCTTAGTGCAG