awk - awk：从文件创建表

Question

我有一个命令日志文件，我想以表格格式选择一些信息。输入是这样的：

####################################################################################################
# Starting pipeline at Mon Jul 29 12:22:56 CEST 2013
# Input files:  test.fastq
# Output Log:  .bpipe/logs/27790.log
# Stage Results
mkdir ./QC_graphics_results/


####################################################################################################
# Starting pipeline at Mon Jul 29 12:22:57 CEST 2013
# Input files:  test.fastq
# Output Log:  .bpipe/logs/27790.log
# Stage Statistics_graph_2
 fastqc test.fastq -o ./QC_graphics_results/
mv .QC_graphics_results/*fastqc .QC_graphics_results/fastqc


####################################################################################################
# Starting pipeline at Mon Jul 29 12:24:18 CEST 2013
# Input files:  test.fastq
# Output Log:  .bpipe/logs/27790.log
# Stage GC_content [all]
# Stage Dinucleotide_odds [all]
# Stage Sequence_duplication [all]
prinseq-lite.pl -fastq test.fastq -graph_data test.Dinucleotide_odds.gd -graph_stats dn -out_good null -out_bad null 
prinseq-lite.pl -fastq test.fastq -graph_data test.Sequence_duplication.gd -graph_stats da -out_good null -out_bad null
prinseq-lite.pl -fastq test.fastq -graph_data test.GC_content.gd -graph_stats gc -out_good null -out_bad null

所需的输出将是一个包含每个阶段和命令的表，如下所示：

    Stage result              mkdir./QC_grahics_results/
Stage Statistics_graph_2      fastqc test.fastq -o ./QC_graphics_results/
Stage GC_content [all]        prinseq-lite.pl -fastq test.fastq -graph_data test.GC_content.gd -graph_stats gc -out_good null -out_bad null
Dinucleotide_odds [all]       prinseq-lite.pl -fastq test.fastq -graph_data test.Sequence_duplication.gd -graph_stats da -out_good null -out_bad null
Stage Sequence_duplication [all]      prinseq-lite.pl -fastq test.fastq -graph_data test.GC_content.gd -graph_stats gc -out_good null -out_bad null

我一直在尝试使用以下代码使用 AWK，但我不工作。有什么建议么？

 cat commandlog.txt | awk '/^#\ Stage*/{print $0} !/^#.*/{print $0}' | awk '{ if ($0 ~ /^#*/){ if (b=1){next} else {a=$0 b=1 next;} else { if (NF!=0){func=$0 b=0 print $a\t$func\n}}' > ./statistic_files/user_options

score 1 · Accepted Answer

将其保存在名为 awk0 的文件中。

NF == 0 {下一个}

substr($1,1,1) == "#" && $2 != "Stage" {next}

$2 == "舞台" && NF == 3 {stage_name = $2 " " $3
                                        下一个 }

舞台名称！=“”{打印舞台名称，$ 0
                                        舞台名称 = ""
                                        下一个}

$2 == "舞台" {arr[$3] = ""
                                        下一个}

                                      {
                                        {对于（我在 arr）{
                                           if (match($0, i) != 0)
                                             打印“阶段”，我，$ 0
                                                        };
                                         }
                                       }

然后运行： cat commandlog.txt | awk -f awk0 > ./statistic_files/user_options

输出：

阶段结果 mkdir ./QC_graphics_results/
阶段 Statistics_graph_2 fastqc test.fastq -o ./QC_graphics_results/
阶段二核苷酸_odds prinseq-lite.pl -fastq test.fastq -graph_data test.Di核苷酸_odds.gd -graph_stats dn -out_good null -out_bad null
阶段 Sequence_duplication prinseq-lite.pl -fastq test.fastq -graph_data test.Sequence_duplication.gd -graph_stats da -out_good null -out_bad null
阶段 GC_content prinseq-lite.pl -fastq test.fastq -graph_data test.GC_content.gd -graph_stats

祝你好运！

score 0 · Accepted Answer

我同意这个问题对于使用简单工具的简单解决方案来说是弱形式化的，在 bash 中尝试这样的事情：

for x in $(awk '/Stage /{print $3}' file.txt);
do
  g=`grep "test.$x.gd" file.txt`;
  test -z "$g" && g=`awk "/Stage ${x}/,/##/" file.txt | grep -v '#'`
  echo -e "Stage $x\t$g";
done

它将从段落中获取舞台名称（不带空格），然后尝试使用-graph_data参数行对其进行映射，如果找不到匹配项，它将在“舞台名称”声明和下一个开始段落之间获取线（假设该段落从##序列开始）。应该管用。

awk - awk：从文件创建表

2 回答 2

Related

Reference