请帮我解析一个VCF文件。我正在粘贴一个真实的例子。
输入:
1 1014143 rs786201005 C T . . RS=786201005;RSPOS=1014143;dbSNPBuildID=144;SSR=0;SAO=1;VP=0x050068000605000002110100;GENEINFO=ISG15:9636;WGT=1;VC=SNV;PM;PMC;NSN;REF;ASP;LSD;OM;CLNALLE=1;CLNHGVS=NC_000001.11:g.1014143C>T;CLNSRC=OMIM_Allelic_Variant;CLNORIGIN=1;CLNSRCID=147571.0003;CLNSIG=5;CLNDSDB=MedGen:OMIM:Orphanet;CLNDSDBID=C4015293:616126:ORPHA319563;CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNREVSTAT=no_criteria;CLNACC=RCV000162196.3
1 1014228 rs1921 G A,C . . RS=1921;RSPOS=1014228;dbSNPBuildID=36;SSR=0;SAO=0;VP=0x050328000a0517053f000100;GENEINFO=ISG15:9636;WGT=1;VC=SNV;PM;PMC;S3D;SLO;NSM;REF;ASP;VLD;G5A;G5;HD;GNO;KGPhase1;KGPhase3;CLNALLE=1;CLNHGVS=NC_000001.11:g.1014228G>A;CLNSRC=.;CLNORIGIN=1;CLNSRCID=.;CLNSIG=2;CLNDSDB=MedGen;CLNDSDBID=CN169374;CLNDBN=not_specified;CLNREVSTAT=single;CLNACC=RCV000455759.1;CAF=0.6611,0.3389,.;COMMON=1
1 1014316 rs672601345 C CG . . RS=672601345;RSPOS=1014319;dbSNPBuildID=142;SSR=0;SAO=1;VP=0x050068001205000002110200;GENEINFO=ISG15:9636;WGT=1;VC=DIV;PM;PMC;NSF;REF;ASP;LSD;OM;CLNALLE=1;CLNHGVS=NC_000001.11:g.1014319dupG;CLNSRC=OMIM_Allelic_Variant;CLNORIGIN=1;CLNSRCID=147571.0002;CLNSIG=5;CLNDSDB=MedGen:OMIM:Orphanet;CLNDSDBID=C4015293:616126:ORPHA319563;CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNREVSTAT=no_criteria;CLNACC=RCV000148989.5
1 1014359 rs672601312 G T . . RS=672601312;RSPOS=1014359;dbSNPBuildID=142;SSR=0;SAO=1;VP=0x050068000605000002110100;GENEINFO=ISG15:9636;WGT=1;VC=SNV;PM;PMC;NSN;REF;ASP;LSD;OM;CLNALLE=1;CLNHGVS=NC_000001.11:g.1014359G>T;CLNSRC=OMIM_Allelic_Variant;CLNORIGIN=1;CLNSRCID=147571.0001;CLNSIG=5;CLNDSDB=MedGen:OMIM:Orphanet;CLNDSDBID=C4015293:616126:ORPHA319563;CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNREVSTAT=no_criteria;CLNACC=RCV000148988.5
1 1020183 rs539283387 G C . . RS=539283387;RSPOS=1020183;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000000a05040026000100;GENEINFO=AGRN:375790;WGT=1;VC=SNV;NSM;REF;ASP;VLD;KGPhase3;CLNALLE=1;CLNHGVS=NC_000001.11:g.1020183G>C;CLNSRC=.;CLNORIGIN=1;CLNSRCID=.;CLNSIG=3;CLNDSDB=MedGen;CLNDSDBID=CN169374;CLNDBN=not_specified;CLNREVSTAT=single;CLNACC=RCV000424799.1;CAF=0.9904,0.009585;COMMON=1
1 1020216 rs764659938 C G . . RS=764659938;RSPOS=1020216;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000000a05040002000100;GENEINFO=AGRN:375790;WGT=1;VC=SNV;NSM;REF;ASP;VLD;CLNALLE=1;CLNHGVS=NC_000001.11:g.1020216C>G;CLNSRC=.;CLNORIGIN=1;CLNSRCID=.;CLNSIG=0;CLNDSDB=MedGen;CLNDSDBID=CN221809;CLNDBN=cancer;CLNREVSTAT=single;CLNACC=RCV000422793.1
我需要一个输出:
1014143 rs786201005 C T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1014228 rs1921 G A,C CLNSIG=2 CLNDBN=not_specified
1014316 rs672601345 C CG CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1014359 rs672601312 G T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1020183 rs539283387 G C CLNSIG=3 CLNDBN=not_specified
1020216 rs764659938 C G CLNSIG=0 CLNDBN=not_provided
这意味着打印第2,3,4,5列,然后解析最后一列并仅打印CLNSIG和CLNDBN。问题是,这些值并不总是处于相同的位置。
我的尝试是:
awk -v OFS="\t"'{print $2,$3,$4,$5,$8}' input
...然后我不知道如何获得CLNSIG和CLNDBN。
谢谢你的任何想法。