我有两个文件:
- file_1 有三列(标记(SNP)、染色体和位置)
- file_2 有三列(Chromosome、peak_start 和 peak_end)。
除 SNP 列外,所有列都是数字。
这些文件的排列方式如屏幕截图所示。file_1 有数百个 SNP 作为行,而 file_2 有 61 个峰。每个峰值由 peak_start 和 peak_end 标记。任何一个文件中都可以有 23 条染色体中的任何一条,并且 file_2 每个染色体都有几个峰。
我想查找每个匹配染色体的 file_1 中 SNP 的位置是否落在 file_2 中的 peak_start 和 peak_end 内。如果是这样,我想显示哪个 SNP 落在哪个峰值中(最好将输出写入制表符分隔的文件)。
我更喜欢拆分文件,并在染色体是关键的地方使用散列。我发现只有几个与此类似的问题,但我无法很好地理解建议的解决方案。
这是我的代码示例。它只是为了说明我的问题,到目前为止还没有做任何事情,所以将其视为“伪代码”。
#!usr/bin/perl
use strict;
use warnings;
my (%peaks, %X81_05);
my @array;
# Open file or die
unless (open (FIRST_SAMPLE, "X81_05.txt")) {
die "Could not open X81_05.txt";
}
# Split the tab-delimited file into respective fields
while (<FIRST_SAMPLE>) {
chomp $_;
next if (m/Chromosome/); # Skip the header
@array = split("\t", $_);
($chr1, $pos, $sample) = @array;
$X81_05{'$array[0]'} = (
'position' =>'$array[1]'
)
}
close (FIRST_SAMPLE);
# Open file using file handle
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
my ($chr, $peak_start, $peak_end);
while (<PEAKS>) {
chomp $_;
next if (m/Chromosome/); # Skip header
($chr, $peak_start, $peak_end) = split(/\t/);
$peaks{$chr}{'peak_start'} = $peak_start;
$peaks{$chr}{'peak_end'} = $peak_end;
}
close (PEAKS);
for my $chr1 (keys %X81_05) {
my $val = $X81_05{$chr1}{'position'};
for my $chr (keys %peaks) {
my $min = $peaks{$chr}{'peak_start'};
my $max = $peaks{$chr}{'peak_end'};
if (($val > $min) and ($val < $max)) {
#print $val, " ", "lies between"," ", $min, " ", "and", " ", $max, "\n";
}
else {
#print $val, " ", "does not lie between"," ", $min, " ", "and", " ", $max, "\n";
}
}
}
更棒的代码: