我一直在尝试使用 MAFFT 命令行工具来识别基因组中的编码区域。我的一般过程是将基因的氨基酸共有序列与目标序列的翻译阅读框对齐。我的方法在很大程度上是成功的。但是,我注意到一些奇怪的对齐方式会阻碍我的注释方法。以下是一个这样的示例(注意 - 我还包含了来自 Pairwise2 Biopython 模块的成对对齐以展示我想要的输出。不幸的是,Pairwise2 的计算时间比 MAFFT 命令行慢了近 20 倍):
from time import *
from Bio.SubsMat import MatrixInfo as matlist
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
from Bio.Align.Applications import MafftCommandline
startTime = time()
sample_tList = [['>Frame 1', 'RIGVGSIPRHLYCQELPLAQPKTCCAETPFRDSPLQGRLGVCPHLASGVALLYGLSTPLTMSGILDRCTCTPNARVFMAEGQVYCTRCLSARSLLPLNLQVPELGVLGLFYRPEEPLRWTLPRAFPTVECSPAGACWLSAIFPIARMTSGNLNFQQRMVRVAAEIYRAGQLTPAVLKVLQVYERGCRWYPIVGPVPGVGVYANSLHVSDKPFPGATHVLTNLPLPQRPKPEDFCPFECAMADVYDIGHGAVMFVAGGKVSWAPRGGDEVRFETVPEELKLIANRLHISFPPHHLVDMSKFAFIVPGSGVSLRVEHQHGCLPADIVPKGNCWWCLFDLLPPGVQNREIRYANQFGYQTKHGVSGKYLQRRLQINGLRAVTDTHGPIVVQYFSVKESWIRHFRLAGEPSLPGFEDLLRIRVESNTSPLADKDEKIFRFGSHKWYGAGKRARKARSGATTTVAHRASSARETRQAKKHEGVDANNAAHLEHYSPPAEGNCGWHCISAIVNRMVNSNFETTLPERVRPSDDWATDEDFVNTIQILRLPAALDRNGACKSAKYVLKLEGEHWTVSVAPGMSPSLLPLECVQGCCEHKGGLGSPDAVEVSGFDPTCLDRLAEVMHLPSSVIPAALAEMSNNSDRPASLVNTAWTVSQFYARHTGGNHRDQVRLGKIISLCQVIEECCCHQNKTNRATPEEVAAKIDQYLRGATSLEECLIKLERVSPPSAADTSFDWNVVLPGVEAAGPTTEQPHANQCCAPVPVVTQEPLDKDSVPLTAFSLSNCYYPAQGDEVRHRERLNSVLSKLEEVVLEEYGLMPTGLGPRPVLPSGLDELKDQMEEDLLKLANAQATSEMMALAAEQVDLKAWVKSYPRWIPPPPPPKVQPRRMKPVKSLPENKPVPAPRRKVRSDPGKSILAVGGPLNFSTPSELVTPLGEPVLMPASQHVSRPVTPLSEPAPVPAPRRIVSRPMTPLSEPTFVFAPWRKSQQVEEANPAAATLTCQDEPLDLSASSQTEYEAYPLAPLENIGVLEAGGQEAEEVLSGISDILDNTNPAPVSSSSSLSSVKITRPKYSAQAIIDSGGPCSGHLQKEKEACLRIMREACDAARLGDPATQEWLSHMWDRVDVLTWRNTSVYQAFRTLDGRFGFLPKMILETPPPYPCGFVMLPHTPTPSVSAESDLTIGSVATEDVPRILGKTENTGNVLNQKPLALFEEEPVCDQPAKDSRTLSRESGDSTTAPPVGTGGAGLPTDLPPLDGVDADGGGLLRTAKGKAERFFDQLSRQVFNIVSHLPVFFSHLFKSDSGYSPGDWGFAAFTLFCLFLCYSYPFFGFAPLLGVFSGSSRRVRMGVFGCWLAFAVGLFKPVSDPVGAACEFDSPECRNILHSFELLKPWDPVRSLVVGPVGLGLAILGRLLGGARYIWHFLLRLGIVADCILAGAYVLSQGRCKKCWGSCIRTAPNEIAFNVFPFTRATRSSLIDLCDRFCAPKGMDPIFLATGWRGCWTGQSPIEQPSEKPIAFAQLDEKRITARTVVSQPYDPNQAVKCLRVLQAGGAMVAEAVPKVVKVSAIPFRAPFFPTGVKVDPECRIVVDPDTFTTALRSGYSTTNLVLGVGDFAQLNGLKIRQISKPSGGGPHLIAALHVACSMVLHMLAGVYVTAVGSCGTGTSDPWCANPFAVPGYGPGSLCTSRLCISQHGLTLPLTALVAGFGLQEIALVVLIFVSIGGMAHRLSCKADMLCILLAIASYVWVPLTWLLCVFPCWLRWFSLHPLTILWLVFFLISVNMPSGILAVVLLVSLWLLGRYTNIAGLVTPYDIHHYTSGPRGVAALATAPDGTYLAAVRRAALTGRTMLFTPSQLGSLLEGAFRTRKPSLNTVNVVGSSMGSGGVFTIDGRIKCVTAAHVLTGNSARVSGVGFNQMLDFDVKGDFAIADCPNWQGVAPKTQFCGDGWTGRAYWLTSSGVEPGVIGDGFAFCFTACGDSGSPVITEAGELVGVHTGSNKQGGGIVTRPSGQFCNVTPIKLSELSEFFAGPKVPLGDVKVGSHIIKDTSEVPSDLCALLAAKPELEGGLSTVQLLCVFFLLWRMMGHAWTPLVAVGFFILNEVLPAVLVRSVFSFGMFALSWLTPWSAQVLMIRLLTAALNRNRVSLIFYSLGAVTGFVADLATTQGHPLQAVMNLSTYAFLPRMMVVTSPVPAIACGVVHLLAIILYLFKYRCLHHVLVGDGAFSAAFFLRYFAEGKLREGVSQSCGMSHESLTGALAIKLSDEDLDFLTKWTDFKCFVSASNMRNAAGQFIEAAYAKALRIELAQLVQVDKVRGTLAKLEAFADTVAPQLSPGDIVVALGHTPVGSIFDLKVGSTKHTLQAIETRVLAGSKMTVARVVDPTPAPPPAPVPIPLPPKVLENGPNAWGGEDRLNKRKRRRMEAVGIFVMDGKKYQKFWDKNSGDVFYEEVHNSTDEWECLRAGDPADFDPETGIQCGHVTIEDKVYNVFTSPSGRRFLVPANPENRRIQWEAARLSVEQALGMMNVDGELTAKELEKLKRIIDKLQGLTKEQCLNCPPVAPAVVAAAWLLLRQRKNFTTGPSPDLTKWPVRLSRTRSSTTNIRLPNRLMVVLCSCAPLFLRLMSSPALMHLLSYLPATGRETLGLMARFGILRPRPPKRKSHLVRKYRLVTLGAVTHLKLVSLISCTLLGATLSGKEFYRIQGLETYLTEPPVTLEAQCMRLPASRPMLLRLMGVPSWPQPCPPVLSCMYRPFQRPSLIILILGLTALNSQSTVVRMLLGTSPNTICPPKALFCLEFFALCGSTCLPMWVSARPFIGLPLTLPRILWLEMGTDFQPRIFRASLKSTFCAHRLCEKTGKLLLLVPSRSSIVGRRRLGQYLALITLRWPTGQRVVLPRASKRHSTRPSPSEKTNLRNYILQFAGALKLILHPAIDPHLQLSAGSLPIFFMNSPVLKSIYRRTCLTAVTTYWLRSPARLREAACRLATRLPPCQTPFTAYMHSTWCSVTLKVVTLMAFCFCKTSSLRTCSRFNPSSIQTTSCCMPSLPPCQITTGGLNITLCVSKRTQRRQPQTRHHFVAGMGVSSLTVTGFLRPSPTIRQAMSLNTTPRRLQYLWTAVLVSMILSGLKSSWLVRSAPARTVTASQARRSSCPCGKNSGPIMKGRSPECAGTAEPRLRTPLPVASTSVLTTPISTSIVLSSGVATRRVLALVVSVNLPWEKAQVLWMRCNKSRISLRGLSCMWSRVSPLLTQVDTKLAADSPLGVASGETKLTCQTVIMPVPPCSPLVKRSTWSLSPPTCCAAGSSSVPPALGKHTGSSNRSRMVMSFTRQLTRPCLTLGLWGCAGSTSQRVRRCNSLPPLVPARGFASWPAVGVLVRIPFWTKQRIAITLMSGFLAKPPLPAEISNNSTRWVLTLIAMFLTSCLRPNRPSGDSDRISVMPSNQITGTNLCPWSTQPVPRWTNLSGMGKSSPPTTGTERTAPSLSTPVKVPHLMWLHCICPLKIHSTGNEPLLLSPGQDMQSSCMTHTGNCRACLIFLRKAHPSTSQCSVTSSSYIEITKNARLLRLAMEINSGLQTSALILSAPFVQIWKGRAPRSPKLHITWGSISHLIHSLLNSQQNSHPTGPWQPRTMKSGLIGWLPAFAPSINIAARALVQAIWWAPRCFAPQGLCHTTSQNLLGARLKCFLRQSSAPAELRIAGSTSMIGSEKLLSPSHMPSLATSKALPVGDVITSPPDTFRASFLRNQLRSGFLAPEKLQRQFAHQMCTSQILKRTSTQRPSPSAGKCWILEKSDWSGKTRRPIFNLKAAISPGINLQATPHTSEFLLILQCIWTPAWALPFATGGLLGPPIGELTSRSPLMITVPKSFCLVHTMVKCLQGTKFWRARSSRLTTQGTNTLGDLNRIQRICTSLLGMVRTGRIIMKRFGRARKGKFIRLLPPASFIFPRALSLNQLATEMKWGLCRASLTKLVNFLWMLSRNFWCPLLISSYFWPFCLASPSPAGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGILWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISNLDVVAHFQHLAAIEAETYKYLASRLPMLHNLRMTGSNVTIVYNSTLNQVFAIFPTSGSRPRLHDSQQWLIAVHSSIFSSVVASCTLFVVLWLRIPMLRSVFGFRWLGAIFLLNSRITRCVRLASPGRPLLRSMNPVGLFGAGGMTDAVRTTMTNGSWFRLASAKATPVFTPGWRSCHSATRPSSIPRYLGGTVKFMLTSRTNSFAPSTTGRTPPCLAMTTFQPYFRPTTNIRSTAVIGFTNGCAPSFPLGWFMFRGFSGVRLQAMFQFKSFRHQDQHYRSIRLCCPPGHQLPVWRLAPSDGSQELSVPHGDRDTRVHHHHSQCHRELFTFFSPHAFLLPFLCFDEKGIQSGIWQCVRHRGCVCLYQLRPTCQGVHPTLLGSRSCATASFHDTDHEVGNRFSLSFCHPTGNLNVQVCWGNAPRAVTRNCFLCGVSCRSVLLCSSTPAATAALIFSFITRYVSMAQIGWQKDLTGQWRLLSFFLCLTLFPMEHSPPAIFLTRLVSLCPPPGSITGGMSVVSMRSVLWLRFASSLGLRRTACPGATLVLDTPTSFWTLRADSIVGGRPLLRKGVRLKSRVTSTSKELCLMVPWQPLPEFQRNNGVVSRRLLPHGSTKGAFGVFHYLYASDDICSKGKSRPTARASAPFDLPELCFYLRVHDIRALSEHKGRAHYGGSSCTSLGGVLSHRNLEIHHLQMPFVLARPQVHSGPCPPRRKCRGLSSDCGKPRICRPASRLHYGRHIGARVEKPRVGWQKSCTGSGKPCQICQITTASSKRERRGTASQSISCARCWVRSSPNKTSPEARDRGRKIIREARRSPIFLRLKKMSGTTSPLVSGNCVCRRSRLPLTRAPGHVPCQIQGGVTLWSLVCRRIILCASASQHHPQHDELAFFGHLGVMIGRMCGEWHLTLCLVTYSIRATVWGSLIGENHAAAIKKKKKKKK'], ['>ORF2_GP2', 'MKWGLCKASLTKLANFLWMLSRSFWCPLLISSYFWPFCLASQSPVGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGVLWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISGLDVVAHFQHLAAIEAETCKYLASRLPMLHNLRLTGSNVTIVYNSTLDQVFAIFPTPGSRPKLHDFQQWLIAVHSSIFSSVAASCTLFVVLWLRIPMLRSVFGFRWLGATFLLNSW']]
ex_file = open("newTempFile112233.fasta", "w")
for items in sample_tList:
ex_file.write(items[0] + "\n")
ex_file.write(items[1] + "\n")
ex_file.close()
in_file = '.../msa_example.fasta'
mafft_exe = '/usr/local/bin/mafft'
mafft_cline = MafftCommandline(mafft_exe, input=in_file) #have to change file path
#mafft_cline = MafftCommandline(mafft_exe, input=in_file, localpair=True, lexp=-1.5, lop=0.5)
stdout, stderr = mafft_cline()
print(stdout)
test_align = AlignIO.read(io.StringIO(stdout), "fasta")
#print(test_align)
os.remove("newTempFile112233.fasta")
print('Total time = ' + str(time() - startTime))
startTime = time()
matrix = matlist.blosum62
pWise_align = pairwise2.align.localds(sample_tList[0][1], sample_tList[1][1], matrix, -6, -1)
print(format_alignment(*pWise_align[0]))
print('Total time = ' + str(time() - startTime))
我试图通过参考帮助文档(http://mafft.cbrc.jp/al ignment/software/manual/manual.html )来更改 MAFFT 命令行对齐算法。我没有收到任何错误消息,但对齐输出没有改变。我不确定需要进行哪些调整。我相信通过增加间隙扩展惩罚(默认为零),对齐将得到改善。在此论坛上或通过 Google 搜索使用 MAFFT 命令行时,我无法找到许多使用自定义变量的文档示例。非常感谢您的帮助。作为参考,可以在此处找到有关 Pairwise2 对齐参数的文档:http: //biopython.org/DIST/docs/api/Bio.pairwise2-module.html