So what I'm trying to do is have a function that finds a sequence 'ATG' in a string and then from there moves along the string in units of 3 until it finds either a 'TAA', 'TAG', or 'TGA' (ATG-xxx-xxx-TAA|TAG|TGA)
To do this, I wrote this line of code (where fdna
is the input sequence)
ORF_sequences = re.findall(r'ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
I then wanted to add 3 requirements:
- Total length must be 30
- Two places before the ATG there must be either an A or a G to be detected (A|G-x-x-A-T-G-x-x-x)
- The next place after the ATG would have to be a G (A-T-G-G-x-x)
To execute this part, I changed my code to:
ORF_sequence_finder = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)
What I want instead of having all of these limits would be to have requirement 1 (greater or equal to 30 characters) and then have EITHER requirement 2 (A|G-x-x-A-T-G-x-x-x) OR requirement 3 (A-T-G-G-x-x) OR both of those.
If I split the above line up into two and appended them to a list, they get out of order and have repeats.
Here are a few examples of the different cases:
sequence1 = 'AGCCATGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGAAAA'
sequence2 = 'ATCCATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTAG'
sequence3 = 'AGCCATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTAG'
sequence4 = 'ATGGGGTGA'
sequence1 = 'A**G**CC*ATG*TGGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TGA*AAA'
sequence1
would be accepted by criteria because it follows requirement 2 (A|G-x-x-A-T-G-x-x-x) and its length is >= 30.
sequence2 = 'ATCC*ATG***G**GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TAG*
sequence2
would be accepted because it follows requirement 3 (A-T-G-G-x-x) and its length is >=30
sequence3 = 'A**G**CC*ATG***G**GGGGGGGGGGGGGGGGGGGGGGGGGGGGG*TGA*AAA'
sequence3
would be accepted because it fulfills both requirement 2 and 3 while also having >=30 character.
sequence4 = 'ATGGGGTGA'
sequence4
would NOT be accepted because its not >= 30, does not follow requirement 2 or requirement 3.
So basically, I want it to accept sequences that either follow requirement 2 AND/OR requirement 3 (or both) while satisfying requirement 1.
How can I split this up without then adding duplicates (in cases where both occur) and getting out of order?