1

I have a text file that contains experimental data in 4-line long groups. I would like to remove those datapoints that are rare. So below is the first 8 lines file format, and this repeats for thousands of lines (the line numbers don't exist in the file):

1 Info_1
2 Sequence_1
3 Info_1
4 Info_1
5 Info_2
6 Sequence_2
7 Info_2
8 Info_2
9 Info_3
10 Sequence_3
11 Info_3
12 Info_3

So, lines 1-4 contains the information for sequence 1, lines 5-8 contains the information for sequence 2, and 9-12 for sequence 3... and so on. It is common in certain situations to remove any group of four lines that contain a sequence that is entirely unique or found fewer than 3 times.

What I would like to do is compare line 2 with lines 6, 10, 14, 18... and if it is found greater than 3 times, do nothing. If it is found 3 times or fewer, remove lines 1-4 and each group of 4 lines that contain the matching sequence. Then run the same comparison for every other line in the file.

So, if the in the above file, Sequence 1 and Sequence 3 match, and because that sequence has only been repeated < 3 times, delete each group of four lines, and the resulting file should look like this:

1 Info_2
2 Sequence_2
3 Info_2
4 Info_2

Here is what I have started with:

awk 'FNR==NR {
    if (FNR%4==2) {
    a[$1]++
    if (a[$1]<3) b[int(FNR/4)]=1
    }
    next}
b[int(FNR/4)]==0' inputFile inputFile > outputFile

This does not delete all of the lines which are found fewer than three times though. I would appreciate any help. Thanks.

Here is a real testable example as requested: Input:

Info1
AAGC
Info1
Info1
Info2
AACT
Info2
Info2
Info3
AAGC
Info3
Info3
Info4
AAGC
Info4
Info4
Info5
AACT
Info5
Info5

Because AAGC occurs >= 3 times but AACT occurs <3 times the output should be:

Info1
AAGC
Info1
Info1
Info3
AAGC
Info3
Info3
Info4
AAGC
Info4
Info4

Hopefully that helps to clarify.

4

3 回答 3

2

对于这个问题最好使用高级编程语言和专门的生物信息库,我使用 python 和 biopython

from Bio import SeqIO

input_file = open("input.fastq")

#to store sequence in dictionary using dna sequences as keys
#it consume RAM (if fastq is very very large, it could be a problem)
dict_seq = {}
for seq in SeqIO.parse(input_file, "fastq"):
  if not str(seq.seq) in dict_seq:
    dict_seq[str(seq.seq)] = []
  dict_seq[str(seq.seq)].append(seq)

#filter sequences
list_filter = [ dna for dna in dict_seq.keys() if len(dict_seq[dna]) >= 3]

#print output in fastq format
output_file = open("filter.fastq", "w")
for dna in list_filter:
  for seq in dict_seq[dna]:
    output_file.write(seq.format("fastq"))

output_file.close()

输入.fastq

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
@EAS54_6_R1_2_1_413_789
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_413_329
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88

过滤器.fastq

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_413_789
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_413_329
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
于 2015-05-29T02:57:47.623 回答
2

扩展你的脚本:

$ awk '
    (NR==FNR){ # First pass, record the count.
        if(FNR%4==2){a[$1]++}
        next;
    }

    (FNR%4){ # remember the group of 4 lines.
        r[FNR%4]=$0;
        next;
    }

    { # On every 4th line, check if the count is >= 3 & print the entire 4 line group.
        if (a[r[2]] >= 3) print r[1] "\n" r[2] "\n" r[3] "\n" $0 
    }' inputFile inputFile
于 2015-05-29T04:32:26.767 回答
1
$ cat tst.awk
{
    numRecs += (NR%4 == 1 ? 1 : 0)
    rec[numRecs,(NR-1)%4 + 1] = $0
    cnt[$0]++
}
END {
    for (recNr=1; recNr<=numRecs; recNr++) {
        if (cnt[rec[recNr,2]] > 2) {
            for (fldNr=1; fldNr<=4; fldNr++) {
                print rec[recNr,fldNr]
            }
        }
    }
}

$ awk -f tst.awk file
Info1
AAGC
Info1
Info1
Info3
AAGC
Info3
Info3
Info4
AAGC
Info4
Info4
于 2015-05-29T04:27:53.093 回答