r - 如何使用 unix 或 R 从位于不同列中的特定模式开始提取单元格

Question

我是 unix 新手，正在努力学习基础知识。我有一个制表符分隔的文件。我想提取以“txGN=”模式开头的单元格，并将它们打印在相应行的新列中。这些单元格位于不同的列中。所有行的列数都不相同。这些值存在于大多数行中，但并非全部存在。

这是文件的外观：

chr1  880942  taPN=-1    taWT=3       txGN=SAMD11   txID=uc001abw   FUNC=nonsyn
chr1  894573  txDN=-3    txGN=NOC2L   txID=uc003    intronic
chr1  10626   txDN=-9    txID=uc2     txST=+

非常感谢你

score 1 · Accepted Answer

#count maximum number of columns in the "file"
maxcol <- max(count.fields("D:/file.txt")) 

x <- read.table("D:/file.txt",as.is=TRUE,fill=TRUE,col.names=1:maxcol)
x[x==""]<-NA
indices<-which(substr(as.matrix(x),start=1,stop=5)=="txGN=",arr.ind=TRUE)

x<-cbind(x,NA)
for(i in 1:nrow(indices)){
  na1<-which(is.na(x[indices[i,1],]))[1]
  x[indices[i,1],na1]<-x[indices[i,1],indices[i,2]] 
}
x
    X1     X2      X3         X4          X5            X6          X7          NA
1 chr1 880942 taPN=-1     taWT=3 txGN=SAMD11 txID=uc001abw FUNC=nonsyn txGN=SAMD11
2 chr1 894573 txDN=-3 txGN=NOC2L  txID=uc003      intronic  txGN=NOC2L        <NA>
3 chr1  10626 txDN=-9   txID=uc2      txST=+          <NA>        <NA>        <NA>

#If you want to "remove" NA's:
x[is.na(x)]<-""

编辑：

这是一个不在 R 中创建数据框的版本（为了减少内存需求），而是将结果附加到新文件中：

maxcol <- max(count.fields("D:/file.txt")) 
maxrow <- length(readLines("D:/file.txt")) 
# bit inefficient, we read the whole file to get the number of lines 

stepsize<-50 # how many lines are read at once
k<-0
while(TRUE){
  if((k+1)*stepsize > maxrow){
    x <- read.table("D:/file.txt",as.is=TRUE,fill=TRUE,col.names=1:maxcol,
                    skip=k*stepsize,nrow=maxrow-k*stepsize+1)
  } else  x <- read.table("D:/file.txt",as.is=TRUE,fill=TRUE,
                          col.names=1:maxcol, skip=k*stepsize,nrow=stepsize)

  if(nrow(x)==0) break #end loop when finished
  x[x==""]<-NA
  indices<-which(substr(as.matrix(x),start=1,stop=5)=="txGN=",arr.ind=TRUE)
  x<-cbind(x,NA)
  for(i in 1:nrow(indices)){
    na1<-which(is.na(x[indices[i,1],]))[1]
    x[indices[i,1],na1]<-x[indices[i,1],indices[i,2]] 
  }
  # New stuff, change sep and eol if needed
  write.table(x, file = "D:/filenew.txt", append = TRUE, quote = FALSE, 
          sep = " ", eol = "\n", na = "",row.names = FALSE, col.names = FALSE)
  k<-k+1
}

read.table("D:/filenew.txt",as.is=TRUE,fill=TRUE,col.names=1:(maxcol+1))

score 0 · Accepted Answer

取决于您所说的“unix”是什么意思，但如果这包括在基于 unix 的系统上常见的命令，那么简单的 Perl 脚本怎么样？将以下内容应用于您的文件

perl -ne 'print /txGN=([^\s]+)/ ? "$1\t$_" : "\t$_";' your-file

要得到

SAMD11  chr1    880942  taPN=-1     taWT=3       txGN=SAMD11
NOC2L   chr1    894573  txDN=-655   txGN=NOC2L   txID=uc001aby.3
        chr1    1062638 txDN=-9758  txID=uc2     txST=+

稍作改写就可以将新专栏放在别处。

r - 如何使用 unix 或 R 从位于不同列中的特定模式开始提取单元格

2 回答 2

Related

Reference