file - awk 条件过滤一个基于另一个文件（或其他解决方案）的文件

Question

这里的编程初学者需要一些帮助来修改 AWK 脚本以使其有条件。替代的非 awk 解决方案也非常受欢迎。

注意由于 Birei 的帮助，主过滤现在可以工作，但我还有一个问题，有关详细信息，请参阅下面的相关注释。

我有一系列包含 3 列的输入文件，如下所示：

chr4    190499999   190999999
chr6    61999999    62499999
chr1    145499999   145999999

我想使用这些行来过滤另一个文件（refGene.txt），如果文件一个中的一行与 refGene.txt 中的一行相匹配，则将 refGene.txt 中的第 13 列输出到新文件“ListofGenes_$f”。对我来说棘手的部分是，只要第一列（例如 'chr4', 'chr6', 'chr1' ）和第 2 列 AND/OR 第 3 列与 refGene 中的等效列匹配，我希望它算作匹配。 txt 文件。两个文件之间的等价列是 $1=$3, $2=$5, $3=$6。然后我不确定在 awk 中如何不从 refGene.txt 打印整行，而只打印第 13 列。

注意由于Birei的帮助，我已经实现了上述条件过滤。现在我需要加入一个额外的过滤条件。如果值 $2 和 $3 之间的任何区域与 refGene.txt 文件中 $5 和 $6 之间的区域重叠，我还需要从 refGene.txt 文件中输出列 $13。这似乎要复杂得多，因为它涉及到数学计算来查看区域是否重叠。

到目前为止我的脚本：

FILES=/files/*txt   
for f in $FILES ;
do

    awk '
        BEGIN {
            FS = "\t";
        }
        FILENAME == ARGV[1] {
            pair[ $1, $2, $3 ] = 1;
            next;
        }
        {
            if ( pair[ $3, $5, $6 ] == 1 ) {
                print $13;
            }
        }
    ' $(basename $f) /files/refGene.txt > /files/results/$(basename $f) ;
done

非常感谢任何帮助。非常感谢！

鲁巴尔

score 1 · Accepted Answer

单程。

awk '
    BEGIN { FS = "\t"; }

    ## Save third, fifth and seventh field of first file in arguments (refGene.txt) as the key
    ## to compare later. As value the field to print.
    FNR == NR {
        pair[ $3, $5, $6 ] = $13;
        next;
    }

    ## Set the name of the output file.
    FNR == 1 {
        output_file = "";
        split( ARGV[ARGIND], path, /\// );
        for ( i = 1; i < length( path ); i++ ) {
            current_file = ( output_file ? "/" : "" ) path[i];
        }
        output_file = output_file "/ListOfGenes_" path[i];
    }

    ## If $1 = $3, $2 = $5 and $3 = $6, print $13 to output file.
    {
        if ( pair[ $1, $2, $3 ] ) {
            print pair[ $1, $2, $3 ] >output_file;
        }
    }
' refGene.txt /files/rubal/*.txt

file - awk 条件过滤一个基于另一个文件（或其他解决方案）的文件

1 回答 1

Related

Reference