我有一个染色体格式的基因组位置列表:开始-结束
例如
chr1:100-110
chr1:1000-1100
chr1:200-300
chr10:100-200
chr2:100-200
chrX:100-200
我想按染色体编号和数字起始位置对其进行排序以获得:
chr1:100-110
chr1:200-300
chr1:1000-1100
chr2:100-200
chr10:100-200
chrX:100-200
在 perl 中执行此操作的好方法是什么?
我有一个染色体格式的基因组位置列表:开始-结束
例如
chr1:100-110
chr1:1000-1100
chr1:200-300
chr10:100-200
chr2:100-200
chrX:100-200
我想按染色体编号和数字起始位置对其进行排序以获得:
chr1:100-110
chr1:200-300
chr1:1000-1100
chr2:100-200
chr10:100-200
chrX:100-200
在 perl 中执行此操作的好方法是什么?
在我看来,您想按以下顺序排序:
所以,也许是这样的自定义排序:
use strict;
use warnings;
print sort {
my @a = split /chr|:|-/, $a;
my @b = split /chr|:|-/, $b;
"$a[1]$b[1]" !~ /\D/ ? $a[1] <=> $b[1] : $a[1] cmp $b[1]
or $a[2] <=> $b[2]
or $a[3] <=> $b[3]
} <DATA>;
__DATA__
chr1:100-110
chr1:1000-1100
chr1:200-300
chr10:100-200
chr2:100-200
chrX:100-200
chrY:100-200
chrX:1-100
chr10:100-150
输出:
chr1:100-110
chr1:200-300
chr1:1000-1100
chr2:100-200
chr10:100-150
chr10:100-200
chrX:1-100
chrX:100-200
chrY:100-200
只需使用模块Sort::Keys::Natural
:
use strict;
use warnings;
use Sort::Key::Natural qw(natsort);
print natsort <DATA>;
__DATA__
chr1:100-110
chr1:1000-1100
chr1:200-300
chr10:100-200
chr2:100-200
chrX:100-200
chrY:100-200
chrX:1-100
chr10:100-150
输出:
chr1:100-110
chr1:200-300
chr1:1000-1100
chr2:100-200
chr10:100-150
chr10:100-200
chrX:1-100
chrX:100-200
chrY:100-200
您可以通过提供自定义比较器对其进行排序。您似乎想要一个两级值作为排序键,因此您的自定义比较器将派生一行的键,然后进行比较:
# You want karyotypical sorting on the first element,
# so set up this hash with an appropriate normalized value
# per available input:
my %karyotypical_sort = (
1 => 1,
...
X => 100,
);
sub row_to_sortable {
my $row = shift;
$row =~ /chr(.+):(\d+)-/; # assuming match here! Be careful
return [$karyotypical_sort{$1}, $2];
}
sub sortable_compare {
my ($one, $two) = @_;
return $one->[0] <=> $two->[0] || $one->[1] <=> $two->[1];
# If first comparison returns 0 then try the second
}
@lines = ...
print join "\n", sort {
sortable_compare(row_to_sortable($a), row_to_sortable($b))
} @lines;
由于计算会稍微繁琐(字符串操作不是免费的),并且由于您可能正在处理大量数据(基因组!),因此如果执行Schwartzian Transform ,您可能会注意到性能有所提高。这是通过预先计算行的排序键然后使用它进行排序并最终删除附加数据来执行的:
@st_lines = map { [ row_to_sortable($_), $_ ] } @lines;
@sorted_st_lines = sort { sortable_compare($a->[0], $b->[0]) } @st_lines;
@sorted_lines = map { $_->[1] } @sorted_st_lines;
或组合:
print join "\n",
map { $_->[1] }
sort { sortable_compare($a->[0], $b->[0]) }
map { [ row_to_sortable($_), $_ ] } @lines;
您可以在以下脚本中执行类似的操作,该脚本根据您的上述输入获取一个文本文件。染色体编号的排序需要稍微改变,因为它不是纯粹的词汇或数字。但我相信你可以调整我在下面的内容:
use strict;
my %chromosomes;
while(<>){
if ($_ =~ /^chr(\w+):(\d+)-\d+$/)
{
my $chr_num = $1;
my $chr_start = $2;
$chromosomes{$1}{$2} = $_;
}
}
my @chr_nums = sort(keys(%chromosomes));
foreach my $chr_num (@chr_nums) {
my @chr_starts = sort { $a <=> $b }(keys(%{$chromosomes{$chr_num}}));
foreach my $chr_start (@chr_starts) {
print "$chromosomes{$chr_num}{$chr_start}";
}
}
1;