I am trying to teach myself bioinformatics, arriving to the party by way of computer science and high performance computing. (Essentially, I'm trying to learn the biology.) I've recently discovered BioPython and so far think it's great, but I am curious if anybody could help me identify why the translate() method used in BioPython to convert sequence data to ORF candidates and amino protein chains is behaving differently than expected.
The following is from this year's DNA60 challenge, and it's to find all of the ORF's in a sequence and sort them, convert them to amino chains, and then take the 25th amino acid from the longest top 15 chains to spell out a phrase.
Here's the challenge: http://genomebiology.com/about/update/DNA60_TAGCGAC
So after doing some research, I settled on using the code directly out of the tutorial for finding and identifying ORF's using the translate method, found here:
http://www.bio-cloud.info/Biopython/en/ch16.html
Modifying it to print out the 25th amino acide for each chain, and sorting the output by chain length (using the linux command line tool "sort"), the output is entirely wrong.
Knowing what the answer was supposed to be, I could not figure out why this wasn't working. So I wrote my own script to do the ORF identification and and translation, sorted the output, and it worked! (Using NCBI table 1, min length of 25.)
So somehow, the ORF identification in the translate method is not working the way I think it should, and I was hoping somebody could tell me why. Below is my code for ORF identification in Python (and you pass in the reverse_complement for the second set of three frames)
def getORF(sequence, treshold, start_codons, stop_codons):
start_codon_index = 0
end_codon_index = 0
start_codon_found = False
orfs = []
for j in range(0, 3):
for indx in range(j, len(sequence), 3):
current_codon = sequence[indx:indx+3]
if current_codon in start_codons and not start_codon_found:
start_codon_found = True
start_codon_index = indx
if current_codon in stop_codons and start_codon_found:
end_codon_index = indx
length = end_codon_index - start_codon_index + 3
if length >= treshold * 3:
orfs.append(start_codon_index)
if length % 3 != 0:
#print "it's going to complain"
#print len(sequence)-end_codon_index-3
snippet = Seq(sequence[start_codon_index:end_codon_index])
proteins = Seq.translate(snippet)
if len(proteins) >= 25:
print "%i [%s]" % (length/3, proteins[24])
start_codon_found = False
start_codon_index = 0
end_codon_index = 0
start_codon_found = False
return len(orfs), orfs
Pretty straightforward. Here's the rest of it:
f = open('genome.fa')
seqrecords = list(SeqIO.parse(f, 'fasta'))
rec = seqrecords[0]
seq = str(rec.seq)
comp_seq = str(rec.seq.reverse_complement())
start = ["ATG"]
stop = ["TAA","TAG","TGA"]
length_orf, orfs = getORF(seq, 25, start, stop)
length_orf_complement, orfs_complement = getComplementORF(comp_seq, 25, start, stop)
Do this once for each strand (initial and reverse_complement) then if you sort the output using the following command to give you the 15 longest:
python orf.py | sort -k1,1n | tail -15
the output is:
155 [E]
157 [F]
158 [I]
163 [L]
166 [F]
167 [Q]
170 [T]
171 [E]
173 [R]
175 [C]
176 [E]
187 [S]
201 [E]
211 [H]
234 [T]
This is the correct phrase. The output using the straight up translate is:
TPKSSSILLRPCQCVSDRKHVTRTAYNFFI...KLA - length 178, strand 1, frame 2 [T]
EAQVRFPVFSSDCPLMMLFSRRLLIGLVRR...GRD - length 181, strand -1, frame 0 [L]
KSGGSTREFRGMSVPEAVRFLKILGNICEQ...RNS - length 181, strand 1, frame 2 [L]
YSLGQQGPEGGVSFEVIAVVVHPKTERGSR...TLA - length 181, strand 1, frame 2 [K]
TVDFQRHSLIVVVARNHLLSTRVQAGLSRD...SWG - length 183, strand -1, frame 0 [Q]
RGRLPDYKTTRACAENTIELRFPPSVYISE...TSN - length 185, strand 1, frame 0 [P]
YYSRLKETPPTQPNPAIMGRRSDETALTRQ...RSF - length 191, strand -1, frame 0 [E]
GPYLWFSCLARGTCKTGDIDYRNSSVVDPY...RPT - length 199, strand 1, frame 0 [S]
LHNQQAQECDDFCMRCRHEVSYSLLNKDGF...LIM - length 199, strand 1, frame 0 [L]
PWLHWESSLGNIFTLRPWVHGFYKEPGCNK...CLF - length 199, strand -1, frame 1 [K]
TQPVQFGLYLTHMAGVGTTREGLTQGLMLY...WHI - length 212, strand -1, frame 0 [T]
VSMVANTFIPLSMGCRYITHSICVSRHMRY...LPV - length 212, strand 1, frame 1 [V]
AVWTSGIELAVQQGTRDVILKNGRQIREVS...QSL - length 219, strand 1, frame 0 [R]
ELTRLDITVLSLCNVSPRNVYGINNASASQ...TIR - length 223, strand -1, frame 0 [N]
TRSKSNGLSMEDNRPLFALRRYWDTTSGSS...KGW - length 242, strand -1, frame 2 [D]
You can see that the lengths don't even match up. What gives here? Am I missing something?