0

I am trying to teach myself bioinformatics, arriving to the party by way of computer science and high performance computing. (Essentially, I'm trying to learn the biology.) I've recently discovered BioPython and so far think it's great, but I am curious if anybody could help me identify why the translate() method used in BioPython to convert sequence data to ORF candidates and amino protein chains is behaving differently than expected.

The following is from this year's DNA60 challenge, and it's to find all of the ORF's in a sequence and sort them, convert them to amino chains, and then take the 25th amino acid from the longest top 15 chains to spell out a phrase.

Here's the challenge: http://genomebiology.com/about/update/DNA60_TAGCGAC

So after doing some research, I settled on using the code directly out of the tutorial for finding and identifying ORF's using the translate method, found here:

http://www.bio-cloud.info/Biopython/en/ch16.html

Modifying it to print out the 25th amino acide for each chain, and sorting the output by chain length (using the linux command line tool "sort"), the output is entirely wrong.

Knowing what the answer was supposed to be, I could not figure out why this wasn't working. So I wrote my own script to do the ORF identification and and translation, sorted the output, and it worked! (Using NCBI table 1, min length of 25.)

So somehow, the ORF identification in the translate method is not working the way I think it should, and I was hoping somebody could tell me why. Below is my code for ORF identification in Python (and you pass in the reverse_complement for the second set of three frames)

def getORF(sequence, treshold, start_codons, stop_codons):
    start_codon_index = 0
    end_codon_index = 0
    start_codon_found = False

    orfs = []

        for j in range(0, 3):
        for indx in range(j, len(sequence), 3):
            current_codon = sequence[indx:indx+3]
            if current_codon in start_codons and not start_codon_found:
                start_codon_found = True
                start_codon_index = indx
            if current_codon in stop_codons and start_codon_found:
                end_codon_index = indx
                length = end_codon_index - start_codon_index + 3
                if length >= treshold * 3:
                    orfs.append(start_codon_index)
                    if length % 3 != 0:
                        #print "it's going to complain"
                    #print len(sequence)-end_codon_index-3                                                                                         
                    snippet = Seq(sequence[start_codon_index:end_codon_index])
                    proteins = Seq.translate(snippet)
                    if len(proteins) >= 25:
                        print "%i [%s]" % (length/3, proteins[24])
                start_codon_found = False

        start_codon_index = 0
        end_codon_index = 0
        start_codon_found = False

    return len(orfs), orfs

Pretty straightforward. Here's the rest of it:

f = open('genome.fa')
seqrecords = list(SeqIO.parse(f, 'fasta'))

rec = seqrecords[0]

seq = str(rec.seq)
comp_seq = str(rec.seq.reverse_complement())

start = ["ATG"]
stop = ["TAA","TAG","TGA"]

length_orf, orfs = getORF(seq, 25, start, stop)
length_orf_complement, orfs_complement = getComplementORF(comp_seq, 25, start, stop)

Do this once for each strand (initial and reverse_complement) then if you sort the output using the following command to give you the 15 longest:

python orf.py | sort -k1,1n | tail -15

the output is:

155 [E]
157 [F]
158 [I]
163 [L]
166 [F]
167 [Q]
170 [T]
171 [E]
173 [R]
175 [C]
176 [E]
187 [S]
201 [E]
211 [H]
234 [T]

This is the correct phrase. The output using the straight up translate is:

TPKSSSILLRPCQCVSDRKHVTRTAYNFFI...KLA - length 178, strand 1, frame 2 [T]
EAQVRFPVFSSDCPLMMLFSRRLLIGLVRR...GRD - length 181, strand -1, frame 0 [L]
KSGGSTREFRGMSVPEAVRFLKILGNICEQ...RNS - length 181, strand 1, frame 2 [L]
YSLGQQGPEGGVSFEVIAVVVHPKTERGSR...TLA - length 181, strand 1, frame 2 [K]
TVDFQRHSLIVVVARNHLLSTRVQAGLSRD...SWG - length 183, strand -1, frame 0 [Q]
RGRLPDYKTTRACAENTIELRFPPSVYISE...TSN - length 185, strand 1, frame 0 [P]
YYSRLKETPPTQPNPAIMGRRSDETALTRQ...RSF - length 191, strand -1, frame 0 [E]
GPYLWFSCLARGTCKTGDIDYRNSSVVDPY...RPT - length 199, strand 1, frame 0 [S]
LHNQQAQECDDFCMRCRHEVSYSLLNKDGF...LIM - length 199, strand 1, frame 0 [L]
PWLHWESSLGNIFTLRPWVHGFYKEPGCNK...CLF - length 199, strand -1, frame 1 [K]
TQPVQFGLYLTHMAGVGTTREGLTQGLMLY...WHI - length 212, strand -1, frame 0 [T]
VSMVANTFIPLSMGCRYITHSICVSRHMRY...LPV - length 212, strand 1, frame 1 [V]
AVWTSGIELAVQQGTRDVILKNGRQIREVS...QSL - length 219, strand 1, frame 0 [R]
ELTRLDITVLSLCNVSPRNVYGINNASASQ...TIR - length 223, strand -1, frame 0 [N]
TRSKSNGLSMEDNRPLFALRRYWDTTSGSS...KGW - length 242, strand -1, frame 2 [D]

You can see that the lengths don't even match up. What gives here? Am I missing something?

4

1 回答 1

1

部分问题是“直接翻译”没有考虑起始密码子。
它只是在每一帧中翻译并在“*”(终止密码子)上分割。
尝试在每个序列中查找第一个“M”(= ATG 的翻译)并从该点开始序列...

于 2013-04-30T00:18:28.917 回答