嗨
,我目前正在制作一个旨在将所有乳头瘤病毒信息整合到一个地方的网站。作为努力的一部分,我们正在整理公共服务器(例如 genbank)上的所有已知文件我遇到的一个问题是,所有已解决的结构中有许多(约 50%)没有根据蛋白质编号。即,子域被结晶(氨基酸310-450),但是结晶学家将其沉积为残基1-140。我想知道是否有人知道重新编号整个 pdb 文件的方法。我找到了重新编号序列的方法(由 seqres 标识),但这不会更新螺旋和工作表信息。如果您有任何建议,我将不胜感激……<br> 谢谢
问问题
4928 次
3 回答
7
我是pdb-tools的维护者——它可能是一个可以帮助你的工具。
我最近修改了residue-renumber
我的应用程序中的脚本以提供更大的灵活性。它现在可以renumber
使用 heatms和特定链,并且可以强制残基编号是连续的,或者只是为所有残基添加用户指定的偏移量。
如果这对您有帮助,请告诉我。
于 2012-12-04T22:12:41.830 回答
1
我也经常遇到这个问题。在放弃了一个旧的 perl 脚本之后,我一直在尝试一些 python。此解决方案假定您已安装 Biopython、ProDy ( http://www.csb.pitt.edu/ProDy/#prody ) 和 EMBOSS ( http://emboss.sourceforge.net/ )。
我在这里使用了一个乳头瘤病毒 PDB 条目。
from Bio import AlignIO,SeqIO,ExPASy,SwissProt
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio.Emboss.Applications import NeedleCommandline
from prody.proteins.pdbfile import parsePDB, writePDB
import os
oneletter = {
'ASP':'D','GLU':'E','ASN':'N','GLN':'Q',
'ARG':'R','LYS':'K','PRO':'P','GLY':'G',
'CYS':'C','THR':'T','SER':'S','MET':'M',
'TRP':'W','PHE':'F','TYR':'Y','HIS':'H',
'ALA':'A','VAL':'V','LEU':'L','ILE':'I',
}
# Retrieve pdb to extract sequence
# Can probably be done with Bio.PDB but being able to use the vmd-like selection algebra is nice
pdbname="2kpl"
selection="chain A"
structure=parsePDB(pdbname)
pdbseq_str=''.join([oneletter[i] for i in structure.select("protein and name CA and %s"%selection).getResnames()])
alnPDBseq=SeqRecord(Seq(pdbseq_str,IUPAC.protein),id=pdbname)
SeqIO.write(alnPDBseq,"%s.fasta"%pdbname,"fasta")
# Retrieve reference sequence
accession="Q96QZ7"
handle = ExPASy.get_sprot_raw(accession)
swissseq = SwissProt.read(handle)
refseq=SeqRecord(Seq(swissseq.sequence,IUPAC.protein),id=accession)
SeqIO.write(refseq, "%s.fasta"%accession,"fasta")
# Do global alignment with needle from EMBOSS, stores entire sequences which makes numbering easier
needle_cli = NeedleCommandline(asequence="%s.fasta"%pdbname,bsequence="%s.fasta"%accession,gapopen=10,gapextend=0.5,outfile="needle.out")
needle_cli()
aln = AlignIO.read("needle.out", "emboss")
os.remove("needle.out")
os.remove("%s.fasta"%pdbname)
os.remove("%s.fasta"%accession)
alnPDBseq = aln[0]
alnREFseq = aln[1]
# Initialize per-letter annotation for pdb sequence record
alnPDBseq.letter_annotations["resnum"]=[None]*len(alnPDBseq)
# Initialize annotation for reference sequence, assume first residue is #1
alnREFseq.letter_annotations["resnum"]=range(1,len(alnREFseq)+1)
# Set new residue numbers in alnPDBseq based on alignment
reslist = [[i,alnREFseq.letter_annotations["resnum"][i]] for i in range(len(alnREFseq)) if alnPDBseq[i] != '-']
for [i,r] in reslist:
alnPDBseq.letter_annotations["resnum"][i]=r
# Set new residue numbers in the structure
newresnums=[i for i in alnPDBseq.letter_annotations["resnum"][:] if i != None]
resindices=structure.select("protein and name CA and %s"%selection).getResindices()
resmatrix = [[newresnums[i],resindices[i]] for i in range(len(newresnums)) ]
for [newresnum,resindex] in resmatrix:
structure.select("resindex %d"%resindex).setResnums(newresnum)
writePDB("%s.renumbered.pdb"%pdbname,structure)
于 2013-07-21T22:32:29.110 回答