-1

输入 1

1 10611 2 122 C:0.983607 G:0.0163934

输入 2

1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08 ;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07

这里第 1 列和第 2 列是匹配的,第一个文件的第 5 列和第 2 个文件的第 4 列的“:”之前的值是 equel,第二个文件的第一个和第 5 列的第 6 列(“:”之前的值)是 equel,输出是基于此匹配创建。将从输入和输出行得到清晰的想法,并且两个文件都是 .gz 文件

输出

1 10611 rs146752890 CG 100 通过 AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08; ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07; REF=0.983607;ALT=0.0163934;

4

2 回答 2

1

这是一种使用方法awk

awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' input1 input2

结果:

1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;

因此,对于压缩文件,请尝试:

awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' <(gzip -dc input1.gz) <(gzip -dc input2.gz) | gzip > output.gz

编辑:

从下面的评论中,试试这个:

awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $1, $2, $3, $4, $5, $6, $7, c[$1,$2,$4,$5] $8 ";" }' file1 file2

结果:

1 10611 rs146752890 C G 100 PASS REF=0.983607;ALT=0.0163934;AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;
于 2012-12-06T12:13:52.013 回答
0

这应该可以工作(假设您有足够的磁盘空间来存储扩展的 .gz 文件):

zcat 1 | awk '{print $1$2,$0}' | sort > new1
zcat 2 | awk '{print $1$2,$0}' | sort > new2
join new1 new2 -11 -21 -o "2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 1.6 1.7"|sed 's/ C:/;REF=/'|sed 's/ G:/;ALT=/' > output
于 2012-12-06T09:46:09.637 回答