I have a text file that contains experimental data in 4-line long groups. I would like to remove those datapoints that are rare. So below is the first 8 lines file format, and this repeats for thousands of lines (the line numbers don't exist in the file):
1 Info_1
2 Sequence_1
3 Info_1
4 Info_1
5 Info_2
6 Sequence_2
7 Info_2
8 Info_2
9 Info_3
10 Sequence_3
11 Info_3
12 Info_3
So, lines 1-4 contains the information for sequence 1, lines 5-8 contains the information for sequence 2, and 9-12 for sequence 3... and so on. It is common in certain situations to remove any group of four lines that contain a sequence that is entirely unique or found fewer than 3 times.
What I would like to do is compare line 2 with lines 6, 10, 14, 18... and if it is found greater than 3 times, do nothing. If it is found 3 times or fewer, remove lines 1-4 and each group of 4 lines that contain the matching sequence. Then run the same comparison for every other line in the file.
So, if the in the above file, Sequence 1 and Sequence 3 match, and because that sequence has only been repeated < 3 times, delete each group of four lines, and the resulting file should look like this:
1 Info_2
2 Sequence_2
3 Info_2
4 Info_2
Here is what I have started with:
awk 'FNR==NR {
if (FNR%4==2) {
a[$1]++
if (a[$1]<3) b[int(FNR/4)]=1
}
next}
b[int(FNR/4)]==0' inputFile inputFile > outputFile
This does not delete all of the lines which are found fewer than three times though. I would appreciate any help. Thanks.
Here is a real testable example as requested: Input:
Info1
AAGC
Info1
Info1
Info2
AACT
Info2
Info2
Info3
AAGC
Info3
Info3
Info4
AAGC
Info4
Info4
Info5
AACT
Info5
Info5
Because AAGC
occurs >= 3
times but AACT
occurs <3
times the output should be:
Info1
AAGC
Info1
Info1
Info3
AAGC
Info3
Info3
Info4
AAGC
Info4
Info4
Hopefully that helps to clarify.