linux - Find duplicates in desired column and print the selected patterns in awk?

Question

Input file: Input.txt

A B C
1 rs1 5
1 kp1 5
1 rs2 6
1 ga2 6
1 rs8 9
2 kp3 7
2 rs3 7
2 rs4 5
2 rs5 8
3 kp6 4
3 kp7 6

For each category in column A (example: 1, 2 and 3) separately look for duplicates in column C. If there are duplicate numbers the print a list of non rs IDs in each category in separate files.

Output files:

file_category_1.txt

A B C
1 kp1 5
1 ga2 6

file_category_2.txt

A B C
2 kp3 7

file_category_3.txt

A B C

Here file_category_3.txt will not have any output because no duplicates in it.

score 2 · Accepted Answer

2

这将使您大部分时间到达那里

awk 'NR==1 {print; next} seen[$1,$3]++ {print}'

于 2013-01-02T17:42:59.590 回答

score 1 · Accepted Answer

未经测试但应该接近：

awk '
NR==1 {
   hdr = $0
   next
}
{
   cnt[$1,$3]++
   cats[$1]
   ids[$2]
   map[$1,$3,$2] = $0
}
END {
   for (cat in cats) {
      print hdr > "file_category_" cat ".txt"
   }
   for (key in cnt) {
      if (cnt[key] > 1) {
         split(key,keyA,SUBSEP)
         for (id in ids) {
            if ((key,id) in map) {
               print map[key,id] > "file_category_" keyA[1] ".txt"
            }
         }
      }
   }
}' file

score 0 · Accepted Answer

not too difficult: Here is the oneliner you can do, in pseudo-code:

But, as Clement pointed out, you should do part of the work ^^ That's why it's only pseudo-code (but can be straightforwardly put into actual code in maybe 3mn)

score 0 · Accepted Answer

您实际上在这里遇到了两个问题，这两个问题都可以使用awk.

首先，您需要先将文件拆分为更小的文件：

awk 'NR==1 { r=$0; next } { print ($1==i ? "" : r ORS) $0 > "file_category_" $1 ".txt"; i=$1 }' input.txt

其次，您需要根据您的选择标准过滤较小的文件：

for i in file_category_*.txt; do awk 'FNR==NR { a[$3]++; next } FNR==1 || a[$3]>1 && $2 !~ /^rs/' "$i"{,} > tmp && mv tmp "$i"; done

这是结果grep . file_category_*.txt：

file_category_1.txt:A B C
file_category_1.txt:1 kp1 5
file_category_1.txt:1 ga2 6
file_category_2.txt:A B C
file_category_2.txt:2 kp3 7
file_category_3.txt:A B C

或者，如果您有GNU awk并且需要单通道解决方案，您可以使用多维数组来做同样的事情。像这样运行：

awk -f script.awk input.txt

内容script.awk：

NR==1 {
    r=$0
    next
}

{
    a[$1][$3][$0]
    next
}

END {
    for (i in a) {
        for (j in a[i]) {
            for (k in a[i][j]) {

                split(k,b)
                print (n==1 ? "" : r ORS) \
                    (length(a[i][j])>1 ?  \
                    (b[2] !~ /^rs/ ? k : "") : "") \
                    > "file_category_" i ".txt"
                n=1
            }
        }
        n=0
    }
}

这是结果grep . file_category_*.txt：

file_category_1.txt:A B C
file_category_1.txt:1 kp1 5
file_category_1.txt:1 ga2 6
file_category_2.txt:A B C
file_category_2.txt:2 kp3 7
file_category_3.txt:A B C

或者，这是单线：

awk 'NR==1 { r=$0; next } { a[$1][$3][$0]; next } END { for (i in a) { for (j in a[i]) for (k in a[i][j]) { split(k,b); print (n==1 ? "" : r ORS) (length(a[i][j])>1 ? (b[2] !~ /^rs/ ? k : "") : "") > "file_category_" i ".txt"; n=1 } n=0 } }' input.txt

linux - Find duplicates in desired column and print the selected patterns in awk?

4 回答 4

Related

Reference