-1

我有两个文本文件的一些数据,例如:

文件 1.txt:

contig postion      majorallele minorallele highqualty reliable defin highqualty 
Contig1         479 *   C   0   0   0   0
Contig1         617 T   A   0   0   0   0
Contig15    243 T   C   0   0   0   0
Contig15    471 T   C   0   0   0   0

文件2.txt

contig 1 chromosome 0 000000476-044111330
contig 1 chromosome 0 000000477-044111331
contig 1 chromosome 0 000000478-044111332
contig 1 chromosome 0 000000479-044111333
contig 1 chromosome 0 000000480-044111334
contig 1 chromosome 0 000000481-044111335
contig 1 chromosome 0 000000482-044111336
contig 15 chromosome 3 000000242-018378247
contig 15 chromosome 3 000000243-018378248
contig 15 chromosome 3 000000244-018378249
contig 15 chromosome 3 000000245-018378250
contig 15 chromosome 3 000000468-018377016
contig 15 chromosome 3 000000469-018377017
contig 15 chromosome 3 000000470-018377018
contig 15 chromosome 3 000000471-018377019
contig 15 chromosome 3 000000472-018377020
contig 15 chromosome 3 000000473-018377021

我想要做的是将 file1.txt 的前两列与 file2.txt 的第一列和第五列进行比较,并将输出返回为:

contig 1 chromosome 0 000000479-044111333 * C   0   0   0   0
contig 15 chromosome 3 000000243-018378248 T    C   0   0   0   0
contig 15 chromosome 3 000000471-018377019 T    C   0   0   0   0

即合并输出中两个文件的匹配行。

4

1 回答 1

0

你可以简单地使用 awk 而不是在 perl 中做到这一点。

awk 'FNR==NR && NR!=1
{x=tolower($1);
y=$2;
$1=$2="";
a[x""y]=$0;
next
}{
b=$5;
gsub(/^0*/,"",b);
split(b,c,"-");
if($1$2c[1] in a)print $0,a[$1$2c[1]]}' file1.txt file2.txt

测试如下:

> cat temp1
contig postion      majorallele minorallele highqualty reliable defin highqualty 
Contig1         479 *   C   0   0   0   0
Contig1         617 T   A   0   0   0   0
Contig15    243 T   C   0   0   0   0
Contig15    471 T   C   0   0   0   0
>
> cat temp2
contig 1 chromosome 0 000000476-044111330
contig 1 chromosome 0 000000477-044111331
contig 1 chromosome 0 000000478-044111332
contig 1 chromosome 0 000000479-044111333
contig 1 chromosome 0 000000480-044111334
contig 1 chromosome 0 000000481-044111335
contig 1 chromosome 0 000000482-044111336
contig 15 chromosome 3 000000242-018378247
contig 15 chromosome 3 000000243-018378248
contig 15 chromosome 3 000000244-018378249
contig 15 chromosome 3 000000245-018378250
contig 15 chromosome 3 000000468-018377016
contig 15 chromosome 3 000000469-018377017
contig 15 chromosome 3 000000470-018377018
contig 15 chromosome 3 000000471-018377019
contig 15 chromosome 3 000000472-018377020
contig 15 chromosome 3 000000473-018377021
>
> nawk 'FNR==NR && NR!=1{x=tolower($1);y=$2;$1=$2="";a[x""y]=$0;next}{b=$5;gsub(/^0*/,"",b);split(b,c,"-");if($1$2c[1] in a)print $0,a[$1$2c[1]]}' temp1 temp2
contig 1 chromosome 0 000000479-044111333   * C 0 0 0 0
contig 15 chromosome 3 000000243-018378248   T C 0 0 0 0
contig 15 chromosome 3 000000471-018377019   T C 0 0 0 0
> 
于 2013-08-02T12:17:57.603 回答