我有一个程序需要用户输入来查找包含蛋白质序列的 FASTA 文件(如果找不到文件,则会给出错误),然后扫描序列并找到以下四个字母序列规则:以“N”开头,然后是除“P”之外的任何内容,然后是“S”或“T”,然后是除“P”之外的任何内容。如果找不到文件,我有一部分会给出错误。但是,在扫描序列时,我只收到一个字母的序列。
这是我的代码:
import re
userinput = input("Please provide a FASTA file.")
while userinput:
try:
if userinput == "0":
break
with open(userinput, mode = 'r') as protein:
readprotein = protein.read()
matches = re.findall('N[^P](S|T)[^P]', readprotein)
for match in matches:
print(match)
break
except FileNotFoundError:
print("File not found. Please ensure that you are typing the file name exactly as it is found with the file extension.")
userinput = input("Please provide a FASTA file. 0 to quit.")
我正在使用的 FASTA 文件是 HIV Type-2 蛋白质组,这里有一个小片段:
>sp|P18096|POL_HV2BE Gag-Pol polyprotein OS=Human immunodeficiency virus type 2 subtype A (isolate BEN) OX=11714 GN=gag-pol PE=3 SV=4
MGARNSVLRGKKADELEKVRLRPGGKKKYRLKHIVWAANELDKFGLAESLLESKEGCQKI
LRVLDPLVPTGSENLKSLFNTVCVIWCLHAEEKVKDTEEAKKLAQRHLVAETGTAEKMPN
TSRPTAPPSGKRGNYPVQQAGGNYVHVPLSPRTLNAWVKLVEEKKFGAEVVPGFQALSEG
CTPYDINQMLNCVGDHQAAMQIIREIINEEAADWDSQHPIPGPLPAGQLRDPRGSDIAGT
TSTVDEQIQWMYRPQNPVPVGNIYRRWIQIGLQKCVRKYNPTNILDIKQGPKEPFQSYVD
RFYKSLRAEQTDPAVKNWMTQTLLIQNANPDCKLVLKGLGMNPTLEEMLTACQGVGGPGQ
KARLMAEALKEAMGPSPIPFAAAQQRKAIRYWNCGKEGHSARQCRAPRRQGCWKCGKPGH
IMANCPERQAGFFRVGPTGKEASQLPRDPSPSGADTNSTSGRSSSGTVGEIYAAREKAEG
AEGETIQRGDGGLAAPRAERDTSQRGDRGLAAPQFSLWKRPVVTAYIEDQPVEVLLDTGA
DDSIVAGIELGDNYTPKIVGGIGGFINTKEYKNVEIKVLNKRVRATIMTGDTPINIFGRN
ILTALGMSLNLPVAKIEPIKVTLKPGKDGPRLKQWPLTKEKIEALKEICEKMEKEGQLEE
APPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQDFTEIQLGIPHPAGLAKKKRISIL
DVGDAYFSIPLHEDFRQYTAFTLPAVNNMEPGKRYIYKVLPQGWKGSPAIFQYTMRQVLE
PFRKANPDVILIQYMDDILIASDRTGLEHDKVVLQLKELLNGLGFSTPDEKFQKDPPFQW
MGCELWPTKWKLQKLQLPQKDIWTVNDIQKLVGVLNWAAQIYSGIKTKHLCRLIRGKMTL
TEEVQWTELAEAELEENKIILSQEQEGYYYQEEKELEATIQKSQGHQWTYKIHQEEKILK
VGKYAKIKNTHTNGVRLLAQVVQKIGKEALVIWGRIPKFHLPVERETWEQWWDNYWQVTW
IPEWDFVSTPPLVRLTFNLVGDPIPGAETFYTDGSCNRQSKEGKAGYVTDRGKDKVKVLE
QTTNQQAELEVFRMALADSGPKVNIIVDSQYVMGIVAGQPTESENRIVNQIIEEMIKKEA
VYVAWVPAHKGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEHEKYHSIIKELTHKFGIPL
LVARQIVNSCAQCQQKGEAIHGQVNAEIGVWQMDYTHLEGKIIIVAVHVASGFIEAEVIP
QESGRQTALFLLKLASRWPITHLHTDNGPNFTSQEVKMVAWWVGIEQSFGVPYNPQSQGV
VEAMNHHLKNQISRIREQANTIETIVLMAVHCMNFKRRGGIGDMTPAERLINMITTEQEI
QFLQRKNSNFKNFQVYYREGRDQLWKGPGELLWKGEGAVIVKVGTDIKVVPRRKAKIIRD
YGGRQELDSSPHLEGAREDGEMACPCQVPEIQNKRPRGGALCSPPQGGMGMVDLQQGNIP
TTRKKSSRNTGILEPNTRKRMALLSCSKINLVYRKVLDRCYPRLCRHPNT
>sp|P18095|GAG_HV2BE Gag polyprotein OS=Human immunodeficiency virus type 2 subtype A (isolate BEN) OX=11714 GN=gag PE=3 SV=3
MGARNSVLRGKKADELEKVRLRPGGKKKYRLKHIVWAANELDKFGLAESLLESKEGCQKI
LRVLDPLVPTGSENLKSLFNTVCVIWCLHAEEKVKDTEEAKKLAQRHLVAETGTAEKMPN
TSRPTAPPSGKRGNYPVQQAGGNYVHVPLSPRTLNAWVKLVEEKKFGAEVVPGFQALSEG
CTPYDINQMLNCVGDHQAAMQIIREIINEEAADWDSQHPIPGPLPAGQLRDPRGSDIAGT
TSTVDEQIQWMYRPQNPVPVGNIYRRWIQIGLQKCVRKYNPTNILDIKQGPKEPFQSYVD
RFYKSLRAEQTDPAVKNWMTQTLLIQNANPDCKLVLKGLGMNPTLEEMLTACQGVGGPGQ
KARLMAEALKEAMGPSPIPFAAAQQRKAIRYWNCGKEGHSARQCRAPRRQGCWKCGKPGH
IMANCPERQAGFLGLGPRGKKPRNFPVTQAPQGLIPTAPPADPAAELLERYMQQGRKQRE
QRERPYKEVTEDLLHLEQRETPHREETEDLLHLNSLFGKDQ
显然,我的代码中的错误在于教授指示我使用的“findall”函数,我认为这可能只是因为我无法完全理解正则表达式的使用。我所拥有的是 re.findall('N^P[^P]', readprotein)。我不明白为什么我得到的单个字母序列甚至不以“N”开头,它只是一堆“T”或“S”。任何帮助表示赞赏!