r - 从蛋白质数据库到 Cosmic 或 Uniprot 的蛋白质序列比对

Question

我想将蛋白质数据库中的 PDB 文件与 Cosmic 或 Uniprot 中显示的蛋白质的规范 AA 序列进行匹配。具体来说，我需要做的是从 pdb 文件中提取主链中的碳 α 原子及其 xyz 位置。我还需要在蛋白质序列中提取它们的实际顺序。对于结构 3GFT（Kras - Uniprot 登录号 P01116），这很简单，我可以取 ResSeq 号。然而，对于其他一些蛋白质，我不知道这是怎么可能的。

例如，对于结构 (2ZHQ)（蛋白质 F2 - Uniprot 登录号 P00734），Seqres 的 ResSeq 编号对于数字“1”和“14”重复，并且仅在 Icode 条目中有所不同。此外，icode 条目不是按字典顺序排列的，因此很难判断提取的顺序。

如果考虑结构 3V5Q（Uniprot 登录号 Q16288），情况会更糟。对于大多数蛋白质，ResSeq 编号与来自 COSMIC 或 UNIPROT 等来源的实际氨基酸相匹配。但是，在位置 711 之后，它会跳转到位置 730。当查看 REMARK 465（丢失的原子）时，它表明对于链 A，726-729 丢失了。然而，在将其与蛋白质匹配后，这些 AA 实际上是 712-715。

我附上了适用于简单 3GFT 示例的代码，但如果有人是 pdb 文件方面的专家并且可以帮助我弄清楚其余部分，我将非常感激。

library(gdata)

#answer<- get.positions("http://www.pdb.org/pdb/files/2ZHQ.pdb","L")
answer<- get.positions("http://www.pdb.org/pdb/files/3GFT.pdb","A")


#This function reads a pdb file and returns the appropriate data structure
get.positions <- function(sourcefile, chain_required = "A"){
N <- 10^5
AACount <- 0
positions = data.frame(Residue=rep(NA, N),AtomCount=rep(0, N),SideChain=rep(NA, N),XCoord=rep(0, N),YCoord=rep(0, N),ZCoord=rep(0, N),stringsAsFactors=FALSE)     

visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
  type = substr(input,14,15)
  if(type == "CA"){
    resSerial = strtoi(substr(input, 7,11))
    residue = substr(input,18,20)
    type_of_chain = substr(input,22,22)
    resSeq = strtoi(substr(input, 23,26))
    altLoc = substr(input,17,17)

    if(resSeq >=1){ #does not include negative residues
      if(type_of_chain == chain_required && !(resSerial %in% visited)  && (altLoc == " " || altLoc == "A") ){
        visited <- c(visited, resSerial)
        AACount <- AACount + 1
        position_string =list()
        position_string[[1]]= as.numeric(substr(input,31,38))
        position_string[[2]] =as.numeric(substr(input,39,46))
        position_string[[3]] =as.numeric(substr(input,47,54))
        #print(input)
        positions[AACount,]<- c(residue, resSeq, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]])
      }
    }
  }
}

  } 
  positions<-positions[1:AACount,]
  positions[,2]<- as.numeric(positions[,2])
  positions[,4]<- as.numeric(positions[,4])
  positions[,5]<- as.numeric(positions[,5])
  positions[,6]<- as.numeric(positions[,6])
  return (positions)
}

score 1 · Accepted Answer

这是一种方法。它需要安装 bio3d R 包 ( http://thegrantlab.org/bio3d/ ) 和肌肉对齐可执行文件。我按照此处http://thegrantlab.org/bio3d/tutorials/installing-bio3d的“附加实用程序”说明进行操作

下面的示例代码似乎适用于您列出的三种情况。

library(bio3d) 

## Download PDB file with given 'id' (can also read from online)
id <-  "3GFT" #"3V5Q"
file.name <- get.pdb(id)
pdb <- read.pdb(file.name)
pdb.seq <- pdbseq(pdb, atom.select(pdb, chain="A", elety="CA"))

## Get UniProt identifier and sequence
pdb.ano <- pdb.annotate(id, "db_id")
uni.seq <- get.seq(pdb.ano)

## Align sequences to define corespondences
aln <- seqaln( seqbind( pdb.seq, uni.seq), id=c(file.name, unlist(pdb.ano)) )

## Read aligned coordinate data (all the info you want is in here)
pdbs <- read.fasta.pdb(aln)

answer2 <- cbind( 1:ncol(pdbs$ali), t(pdbs$ali), 
            pdbs$resno[1,], matrix(pdbs$xyz[1,], ncol=3, byrow=T) ) 

head(answer2)

[1,] "1" "M"        "M"    "1" "62.935" "97.579" "30.223"
[2,] "2" "T"        "T"    "2" "63.155" "95.525" "27.079"
[3,] "3" "E"        "E"    "3" "65.289" "96.895" "24.308"
[4,] "4" "Y"        "Y"    "4" "64.899" "96.22"  "20.615"
[5,] "5" "K"        "K"    "5" "67.593" "96.715" "18.023"
[6,] "6" "L"        "L"    "6" "65.898" "97.863" "14.816"

如果您希望您的氨基酸以 3 个字母代码列出，那么 bio3d 中有一个 aa321() 函数。

score 1 · Accepted Answer

你可能想把这个问题移到 www.biostars.org 并写信给 help@uniprot.org（你知道这些序列已经在数据库级别链接了对吗？）无论如何写信给 help@uniprot.org 时问Jules Jacobsen 是 UniProt 的常驻专家，负责将 PDB 结构与 uniprot.org 规范序列联系起来。

r - 从蛋白质数据库到 Cosmic 或 Uniprot 的蛋白质序列比对

2 回答 2

Related

Reference