5

这是一个小例子:

X1 <- c("AC", "AC", "AC", "CA", "TA", "AT", "CC", "CC")
X2 <- c("AC", "AC", "AC", "CA", "AT", "CA", "AC", "TC")
X3 <- c("AC", "AC", "AC", "AC", "AA", "AT", "CC", "CA")
mydf1 <- data.frame(X1, X2, X3)

输入数据框

  X1 X2 X3
1 AC AC AC
2 AC AC AC
3 AC AC AC
4 CA CA AC
5 TA AT AA
6 AT CA AT
7 CC AC CC
8 CC TC CA

功能

# Function 
atgc <- function(x) {
 xlate <- c( "AA" = 11, "AC" = 12, "AG" = 13, "AT" = 14,
"CA"= 12, "CC" = 22, "CG"= 23,"CT"= 24,
 "GA" = 13, "GC" = 23, "GG"= 33,"GT"= 34,
 "TA"= 14,  "TC" = 24, "TG"= 34,"TT"=44,
"ID"= 56, "DI"= 56, "DD"= 55, "II"= 66
 )
  x =   xlate[x]
 }
outdataframe <- sapply (mydf1, atgc)
outdataframe
   X1 X2 X3
AA 11 11 12
AA 11 11 12
AA 11 11 12
AG 13 13 12
CA 12 12 11
AC 12 13 13
AT 14 11 12
AT 14 14 14

问题,AC 在输出中不等于 12,而是 11,其他情况类似。乱七八糟!

(Exta:我也不知道如何摆脱行名。)

4

4 回答 4

4

只需使用apply和转置:

t(apply (mydf1, 1, atgc))

要使用sapply,然后使用:

  1. stringsAsFactors=FALSE创建数据框时,即

    mydf1 <- data.frame(X1, X2, X3, stringsAsFactors=FALSE)
    

    (感谢@joran)或

  2. 将函数的最后一行更改为:x = xlate[as.vector(x)]

于 2012-04-27T15:50:04.807 回答
1

`match 函数可以使用带有“字符”类的目标匹配向量的因子参数:

atgc <- function(fac){ c(11, 12, 13, 14, 
12, 22, 23, 24, 
13, 23, 33, 34, 
14, 24, 34,44, 
56, 56, 55, 66 )[ 
match(fac, 
  c("AA", "AC", "AG", "AT",
    "CA", "CC", "CG","CT",
    "GA", "GC", "GG","GT" ,
    "TA",  "TC", "TG","TT",
    "ID", "DI", "DD", "II") )
                ]}
#The match function returns an index that is designed to pull from a vector.
 sapply(mydf1, atgc)
     X1 X2 X3
[1,] 12 12 12
[2,] 12 12 12
[3,] 12 12 12
[4,] 12 12 12
[5,] 14 14 11
[6,] 14 12 14
[7,] 22 12 22
[8,] 22 24 12
于 2012-04-27T16:22:19.970 回答
0

这样,您只需为矩阵中的每个字母提供替换值,而无需仔细检查以确保您考虑了所有组合并正确匹配它们,尽管在您的示例中组合是有限的。

用值及其替代定义列表:

trans <- list(c("A","1"),c("C","2"),c("G","3"),c("T","4"),
  c("I","6"),c("D","5"))

使用定义替换函数gsub()

atgc2 <- function(myData, x) gsub(x[1], x[2], myData)

创建一个具有替换值的矩阵mydf1(在这种情况下,转换为矩阵会根据需要返回字符值gsub(),但您需要在继续之前检查这是否适用于任何其他数据)

mymat <- Reduce(atgc2, trans, init = as.matrix(mydf1))

中的值mymat仍然是它们最初出现的顺序,所以"AC" = "12""CA" = "21",所以重新排序它们(并将它们转换为数值)

ansVec <- sapply( strsplit( mymat, split = ""),
  function(x) as.numeric( paste0( sort( as.numeric(x) ), collapse = "")))

该对象ansVec是一个向量,因此将其转换回 data.frame

( mydf2 <- data.frame( matrix( ansVec, nrow = nrow(mydf1) ) ) )
#   X1 X2 X3
# 1 12 12 12
# 2 12 12 12
# 3 12 12 12
# 4 12 12 12
# 5 14 14 11
# 6 14 12 14
# 7 22 12 22
# 8 22 24 12

对于这种情况,其他答案肯定更快。但是,随着替换操作变得越来越复杂,我认为这种解决方案可能会带来一些好处。但是,此方法无法解决的一个方面是检查字符串"ATTGCG"中的"ATT""TTG"

于 2012-04-27T22:01:55.440 回答
0

实际上,我认为您想将原始向量表示为因子,因为它们表示有限的一组水平(DNA 二核苷酸)而不是任意字符值。

lvls = c("AA", "AC", "AG", "AT", "CA", "CC", "CG", "CT", "GA", "GC", 
         "GG", "GT", "TA", "TC", "TG", "TT", "ID", "DI", "DD", "II")
X1 <- factor(c("AC", "AC", "AC", "CA", "TA", "AT", "CC", "CC"), levels=lvls)
X2 <- factor(c("AC", "AC", "AC", "CA", "AT", "CA", "AC", "TC"), levels=lvls)
X3 <- factor(c("AC", "AC", "AC", "AC", "AA", "AT", "CC", "CA"), levels=lvls)
mydf1 <- data.frame(X1, X2, X3)

同样,“11”是因子的级别,而不是数字 11。所以级别之间的映射是

xlate <- c("AA" = "11", "AC" = "12", "AG" = "13", "AT" = "14",
           "CA"= "12", "CC" = "22", "CG"= "23","CT"= "24",
           "GA" = "13", "GC" = "23", "GG"= "33","GT"= "34",
           "TA"= "14",  "TC" = "24", "TG"= "34","TT"="44",
           "ID"= "56", "DI"= "56", "DD"= "55", "II"= "66")

并“重新调整”单个变量

levels(X1) <- xlate

要重新调整数据框的所有列,

as.data.frame(lapply(mydf1, `levels<-`, xlate))

使用sapply是不合适的,因为这会创建一个(字符的)矩阵,即使您已将其命名为outdataframe. 这种区别实际上可能对这可能代表的 SNP 数据很重要,因为作为矩阵的 1000 个样本中的数百万个 SNP 将被实现一个长度比最长向量 R 可以存储的长度更长的单个向量(模大向量支持被引入R-devel),而数据框将是一个向量列表,每个向量只有数百万个元素。

于 2012-04-28T00:53:43.337 回答