loops - 使用 bcftools 和 awk 在多种文件类型上运行循环以细分文件

Question

亲爱的堆栈溢出社区，

我有 100 个 .VCF 文件（一种 txt 文件）。在“ID”列中有不同的结构变体调用：

MantaINS
MantaINV
MantaDEL
MantaBND
MantaDUP
Canvas:REF
Canvas:GAIN
Canvas:LOSS

（连同一个数字，例如 MantaINS:00:13:467、Canvas:Gain:594:31:23 等）

文件看起来像这样（但更大，每个文件有数千个条目）

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 2827693 MantaDEL:0:2:5000 CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9
2 321682  MantaBND:5:7:1:0 6 PASS IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12
2 14477084 MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12
3 9425916 MantaDEL:5:333:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
3 2658945 MantaDUP:5:22:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
6 1325462 MantaINV:3:000:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
6 5783961 CavnasREF:7:943:1453 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
7 9425916 CanvasGAIN:9:323:123 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
8 9425916 CanvasLOSS:2:932:123 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15

每个文件都在一个单独的文件夹中，我为所有 100 个 vcfs 生成了一个文件路径的 txt 文件。该文件如下所示（仅前 4 个）：

genomes/by_date/2015-09-03/batch1/patient30/patient30.SV.vcf.gz   
genomes/by_date/2016-03-05/batch1/patient4/patient4.SV.vcf.gz    
genomes/by_date/2018-10-14/batch1/patient16/patient16.SV.vcf.gz   
genomes/by_date/2018-012-28/batch1/patient100/patient100.SV.vcf.gz
genomes/by_date/2018-03-14/batch1/patient1/patient1.SV.vcf.gz

我想按在 ID 列中找到的结构变体类型对文件进行细分，因此对于每个输入 vcf 文件，我得到 8 个按 ID 类型划分的输出文件，例如对于 Manta_INS 我想要一个只有以下行的 .txt 文件取自上面的例子：

2 14477084 MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL END=14477381 SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12

即对于每个输入 vcf，我希望输出为 8 个细分文件。

（例如 person 1.vcf -> person1_MantaINS.txt、person1_MantaDEL.txt、person1_MantaINV.txt 等）

在我运行的单个 VCF 文件上：

for T in   MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas
do
   bcftools view person1.vcf  | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $0}'  > ${T}.txt
done

效果很好（除了其中有冒号的 Canvas 调用）。但是，我想输入一个包含 100 个文件的列表来运行相同的循环。

我累了：

for T in   MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas:REF Canvas: GAIN Canvas:LOSS
    do
       parallel -j6 "bcftools view {}  | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $0}'  > $basename{}.txt" :::: paths_to_files.txt
    done

这给了我一条错误消息：对于我的任何文件类型，并行内“没有这样的文件或目录”。

我正在通过远程终端处理 HPC。

您的帮助将不胜感激。

非常感谢

score 0 · Accepted Answer

您写道，这对于单个 VCF 文件非常有效：

for T in   MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas
do
   bcftools view person1.vcf  | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $0}'  > ${T}.txt
done

那么这也应该起作用：

doit() {
    vcf="$1"
    out="$2"
    T="$3"
    bcftools view "$vcf" |
       awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $0}'  > "$out"
}

for T in   MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas
do
   doit person1.vcf person1_${T}.txt ${T}
done

如果可行，那么这也应该可行：

export -f doit
parallel doit {1} {1.}_{2}.txt {2} \
:::: list_of_vcf_files \
::: MantaINS MantaINV MantaDEL MantaBND MantaDUP Canvas

如果这不是您想要的，请显示 3 个您想要执行的命令的完整示例。

（我也不清楚你想要运行什么命令Canvas:GAIN，所以请让它成为 3 个示例之一）。

loops - 使用 bcftools 和 awk 在多种文件类型上运行循环以细分文件

1 回答 1

Related

Reference