arrays - 如何在数组元素中搜索哈希键中的匹配项

Question

我有一个数组，其中包含 DNA 序列的唯一 ID（数字）。我已经将我的 DNA 序列放在一个散列中，这样每个键都包含一个描述性标题，它的值就是 DNA 序列。此列表中的每个标题都包含基因信息，并以其唯一的 ID 号为后缀：

唯一 ID：14272

标头（哈希键）：PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272

序列（哈希值）：ATGGGTC...

我想循环浏览每个唯一 ID，看看它是否与每个标题（哈希键）末尾的数字匹配，如果是，则将哈希键 + 值打印到文件中。到目前为止，我有这个：

my %hash; 
@hash{@hash_index} = @hash_seq;

foreach $hash_index (sort keys %hash) {
        for ($i=0; $i <= $#scaffoldnames; $i++) {
            if ($hash_index =~ /$scaffoldnames[$i]/) {
                print GENE_ID "$hash_index\n$hash{$hash_index}\n";
        }
    }
}
close(GENE_ID);

因此，唯一 ID 包含在 @scaffoldnames 中。

这不行！我不确定如何最好地循环遍历哈希和数组以找到匹配项。

我将在下面展开：

上游代码：

foreach(@scaffoldnames) {
     s/[^0-9]*//g;
} #Remove all non-numerics

my @genes = read_file('splice.txt'); #Splice.txt is a fasta file

my $hash_index = '';
my $hash_seq = '';
foreach(@genes){
    if (/^>/){
        my $head = $_;
        $hash_index .= $head; #Collect all heads for hash
    }
        else {
            my $sequence = $_;
            $hash_seq .= $sequence; #Collect all sequences for hash
        }
}

my @hash_index = split(/\n/,$hash_index); #element[0]=head1, element[1]=head2
my @hash_seq = split(/\n/, $hash_seq); #element[0]=seq1, element[1]=seq2

my %hash; # Make hash from both arrays - heads as keys, seqs as values
@hash{@hash_index} = @hash_seq;

foreach $hash_index (sort keys %hash) {
        for ($i=0; $i <= $#scaffoldnames; $i++) {
            if ($hash_index =~ /$scaffoldnames[$i]$/) {
                print GENE_ID "$hash_index\n$hash{$hash_index}\n";
        }
    }
}
close(GENE_ID);

我正在尝试分离 cuffdiff (RNA-Seq) 输出的所有不同表达的基因（通过唯一 ID），并将它们与它们来自的支架（在本例中为表达序列）相关联。

因此，我希望我可以隔离每个唯一 ID 并搜索原始 fasta 文件以提取它匹配的标头及其关联的序列。

score 4 · Accepted Answer

您似乎错过了哈希的要点：它们用于通过键索引您的数据，以便您可以一步访问相关信息，就像您可以使用数组一样。循环遍历每个哈希元素有点破坏了这一点。例如，你不会写

my $value;

for my $i (0 .. $#data) {
  $value = $data[i] if $i == 5;
}

你只需这样做

my $value = $data[5];

如果没有更多关于您的信息来自哪里以及您想要什么的更多信息，很难提供适当的帮助，但是这段代码应该会有所帮助。

我使用了我认为看起来像您正在使用的单元素数组，并使用 ID（标头的尾随数字）作为键构建了一个哈希，该哈希将标头和序列作为双元素数组进行索引. 14272您可以使用查找信息，例如 ID $hash{14272}。标题是$hash{14272}[0]，序列是$hash{14272}[1]

如果您提供更多关于您的情况的指示，那么我们可以进一步帮助您。

use strict;
use warnings;

my @hash_index = ('PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272');
my @hash_seq = ('ATGGGTC...');

my @scaffoldnames = (14272);

my %hash = map {
  my ($key) = $hash_index[$_] =~ /(\d+)\z/;
  $key => [ $hash_index[$_], $hash_seq[$_] ];
} 0 .. $#hash_index;

open my $gene_fh, '>', 'gene_id.txt' or die $!;

for my $name (@scaffoldnames) {
  next unless my $info = $hash{$name};
  printf $gene_fh "%s\n%s\n", @$info;
}

close $gene_fh;

更新

从您发布的新代码看来，您可以用此代码替换该部分。

它的工作原理是从它找到的每个序列头中获取尾随数字，并使用它作为键来选择一个哈希元素来附加数据。哈希值是标题和序列，都在一个字符串中。如果您有理由将它们分开，请告诉我。

foreach (@scaffoldnames) {
    s/\D+//g;
}    # Remove all non-numerics

open my $splice_fh, '<', 'splice.txt' or die $!;    # splice.txt is a FASTA file

my %sequences;

my $id;
while (<$splice_fh>) {
    ($id) = /(\d+)$/ if /^>/;
    $sequences{$id} .= $_ if $id;
}

for my $id (@scaffoldnames) {
    if (my $sequence = $sequences{$id}) {
        print GENE_ID $sequence;
    }
}

arrays - 如何在数组元素中搜索哈希键中的匹配项

1 回答 1

Related

Reference