r - 转换/更新表中基因 ID 的最快方法？

Question

注意：我不是在问特定于 Bioconductor 的问题，而是在示例代码中需要 Bioconductor。请忍受我。

你好，

我有许多制表符分隔的文件，其中包含有关特定基因的各种类型的信息。一列或多列可以是我需要升级到最新基因符号注释的基因符号的别名。

我正在使用 Bioconductor 的 org.Hs.eg.db 库来执行此操作（特别是 org.Hs.egALIAS2EG 和 org.Hs.egSYMBOL 对象）。

报告的代码可以完成这项工作，但速度很慢，我猜是因为嵌套的 for 循环在每次迭代时查询 org.Hs.eg.db 数据库。是否有更快/更简单/更智能的方法来实现相同的结果？

library(org.Hs.eg.db)

myTable <- read.table("tab_delimited_file.txt", header=TRUE, sep="\t", as.is=TRUE)

for (i in 1:nrow(myTable)) {
    for (j in 1:ncol(myTable)) {
        repl <- org.Hs.egALIAS2EG[[myTable[i,j]]][1]
        if (!is.null(repl)) {
            repl <- org.Hs.egSYMBOL[[repl]][1]
            if (!is.null(repl)) {
                myTable[i,j] <- repl
            }
        }
    }
}

write.table(myTable, file="new_tab_delimited_file", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)

我正在考虑使用其中一个 apply 函数，但请记住 org.Hs.egALIAS2EG 和 org.Hs.egSYMBOL 是对象，而不是函数。

谢谢！

score 2 · Accepted Answer

使用该mget功能。

eg[i,] <- mget( myTable[i,],  org.Hs.egALIAS2EG )
symbol[i, ] <- mget( myTable[i,], org.Hs.egSYMBOL )

等等。这是它应该使用的方式，并且比任何其他替代方式都要快得多。但是，也许首先将 myTable 重塑为基因名称向量是值得的：

v <- unique( as.vector( as.matrix( myTable ) ) )
v <- v[ v %in% ls( org.Hs.egALIAS2EG ) ]
eg <- unlist( mget( v, org.Hs.egALIAS2EG ) )
symbol <- unlist( mget( eg, org.Hs.egSYMBOL ) )

等等。上面的第二行确保您只查找数据库中实际存在的符号。现在您可以使用符号表再次修改该表。这是一种可以做到的方法，假设并非 myTable 的所有元素都匹配。为简洁起见，我将表格复制到t：

t <- as.matrix( myTable )
names( symbol ) <- v
t[ !is.na( match( t, v ) ) ] <- symbol[ match( t, v ) ][ ! is.na( match( t, v ) ) ]

好的。那是假设我们正在处理一个（或多或少）字符矩阵。但是，坦率地说，您只有一个包含两列的数据框，因此无需像拥有数百列那样真正自动化代码。让我们写一个小函数。（如果我们可以假设表中的所有元素都可以在 org.Hs.egALIAS2EG 中找到，那就更简单了）

convert2symbol <- function( x ) {
  v <- unique( as.character( x ) )
  v <- v[ v %in% ls( org.Hs.egALIAS2EG ) ]
  eg <- unlist( mget( v, org.Hs.egALIAS2EG ) )
  symbol <- unlist( mget( eg, org.Hs.egSYMBOL ) )
  m <- match( x, v )
  v[ ! is.na( m ) ] <- symbol[ m ][ ! is.na( m ) ]
  v
}

现在你可以

myTable$LigandGene <- convert2symbol( myTable$LigandGene )

或者

newTable <- apply( myTable, 2, convert2symbol )

至于为什么 as.vector(data.frame) 不起作用： data.frame 不是矩阵。它是一个以花哨的方式显示并在[]其上定义了功能的列表。

score 1 · Accepted Answer

您可以使用 sapply 并命名几个不是向量的变量，例如org.Hs.eg.db库中的对象：

library(org.Hs.eg.db)
myTable <- read.table("tab_delimited_file.txt", header=TRUE, sep="\t", as.is=TRUE)

myfunc <- function(idx,mytab,a2e,es){
            i = idx %/% nrow(mytab) + 1
            j = idx %% ncol(mytab) + 1
            repl <- a2e[[myTable[i,j]]][1];
            if (!is.null(repl)) {
              repl <- es[[repl]][1]
              if (!is.null(repl)) {
                return(repl)
              }
            }
            else {return("NA")}
          }

vec <- 0:(ncol(myTable)*nrow(myTable)-1)
out <- sapply(vec,mytab=myTable,a2e=org.Hs.egALIAS2EG,es=org.Hs.egSYMBOL,myfunc)
myTable <- matrix(out, nrow=nrow(myTable),ncol=ncol(myTable),byrow=T)

score 1 · Accepted Answer

只是一个快速警告：一个别名可以映射到多个 Entrez 基因 ID。

因此，您当前的解决方案假定第一个列出的 ID 是正确的（这可能不是真的）。

# e.g. The alias "A1B" is assumed to map to "1" and not "6641"
mget("A1B", org.Hs.egALIAS2EG)
# $A1B
# [1] "1"    "6641"

如果您查看的帮助?org.Hs.egALIAS2EG，您会发现从不建议使用别名或符号作为主要基因标识符。

## From the 'Details' section of the help:
# Since gene symbols are sometimes redundantly assigned in the literature, 
# users are cautioned that this map may produce multiple matching results 
# for a single gene symbol. Users should map back from the entrez gene IDs 
# produced to determine which result is the one they want when this happens.

# Because of this problem with redundant assigment of gene symbols, 
# is it never advisable to use gene symbols as primary identifiers.

如果没有人工管理，就不可能知道哪个 ID 是“正确的”。因此，最安全的选择是获取表中每个别名的所有可能 ID 和符号，同时保留有关哪些是受体和哪些是配体的信息：

# your example subset with "A1B" and "trash" added for complexity
myTable <- data.frame(
    ReceptorGene = c("A1B", "ACVR2B", "ACVR2B", "ACVR2B", "ACVR2B", "AMHR2", "BLR1", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A"),
    LigandGene = c("trash", "INHA", "INHBA", "INHBB", "INHBC", "AMH", "SCYB13", "BMP10", "BMP15", "BMP2", "BMP3", "BMP4"), 
    stringsAsFactors = FALSE
)

# unlist and rename
my.aliases <- unlist(myTable)
names(my.aliases) <- paste(names(my.aliases), my.aliases, sep = ".")

# determine which aliases have a corresponding Entrez Gene ID
has.key <- my.aliases %in% keys(org.Hs.egALIAS2EG)

# replace Aliases with character vectors of all possible entrez gene IDs 
my.aliases[has.key] <- sapply(my.aliases[has.key], function(x) {
    eg.ids <- unlist(mget(x, org.Hs.egALIAS2EG))
    symbols <- unlist(mget(eg.ids, org.Hs.egSYMBOL))
})

# my.aliases retains all pertinent information regarding the original alias
my.aliases[1:3]
# $ReceptorGene1.A1B
#       1    6641 
#  "A1BG" "SNTB1" 
# 
# $ReceptorGene2.ACVR2B
#       93 
# "ACVR2B" 
# 
# $ReceptorGene3.ACVR2B
#       93 
# "ACVR2B"

一旦您知道哪些 Entrez 基因 ID 是合适的，您就可以将它们作为附加列存储在您的表格中。

myTable$receptor.id <- c("1", "93", "93", "93", "93", "269", "643", "657", "657", "657", "657", "657") 
myTable$ligand.id   <- c(NA, "3623", "3624", "3625", "3626", "268", "10563", "27302", "9210", "650", "651", "652")

然后，当您需要更新到最新的符号时，您可以只使用 Entrez 基因 ID（永远不需要更新）。

has.key <- myTable$receptor.id %in% keys(org.Hs.egSYMBOL)
myTable$ReceptorGene[has.key] <- unlist(mget(myTable$receptor.id[has.key], org.Hs.egSYMBOL))

has.key <- myTable$ligand.id %in% keys(org.Hs.egSYMBOL)
myTable$LigandGene[has.key] <- unlist(mget(myTable$ligand.id[has.key], org.Hs.egSYMBOL))

head(myTable)
#   ReceptorGene LigandGene receptor.id ligand.id
# 1         A1BG      trash           1      <NA>
# 2       ACVR2B       INHA          93      3623
# 3       ACVR2B      INHBA          93      3624
# 4       ACVR2B      INHBB          93      3625
# 5       ACVR2B      INHBC          93      3626
# 6        AMHR2        AMH         269       268

score 0 · Accepted Answer

这是我能想到的最好的。

先写一个函数：

alias2GS <- function(x) {
    for (i in 1:length(x)) {
        if (!is.na(x[i])) {
            repl <- org.Hs.egALIAS2EG[[x[i]]][1]
            if (!is.null(repl)) {
                repl <- org.Hs.egSYMBOL[[repl]][1]
                if (!is.null(repl)) {
                    x[i] <- repl
                }
            }
        }
    }
    return(x)
}

然后为需要转换的数据框的每一列调用该函数，如下所示：

df$GeneSymbols <- alias2GS(df$GeneSymbols)

r - 转换/更新表中基因 ID 的最快方法？

4 回答 4

Related

Reference