如何在 python 中使用正则表达式来捕获两个字符串或短语之间的内容,并删除该行中的所有其他内容?
例如,下面是一个以单行标题开头的蛋白质序列。如何根据短语“FlyBase_Annotation_IDs:”之后和下一个逗号“,”之前的规定从下面的标题中筛选出“CG33289-PC”?
我需要用这个简化的结果“CG33289-PC”替换标题,而不是破坏蛋白质序列(在所有大写的标题行下方找到)。
这是每个蛋白质序列条目的样子 - 一个标题后跟一个序列:
>FBpp0293870 类型=蛋白质;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..2);1926 ID=FBpp0293870;名称=CG33289-PC;父=FBgn0053289,FBtr0305327;dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC;MD5=478485a27487608aa2b6c35d39a3295c;长度=405;释放=r5.45;物种=Dmel;MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN FSRAV
这是所需的输出:
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN FSRAV