我正在寻找一种更快的方法来计算从 FASTA 文件中读取的 DNA 字符串的 GC 内容。这归结为获取一个字符串并计算字母“G”或“C”出现的次数。我还想指定要考虑的字符范围。
我有一个相当慢的工作函数,它导致我的代码出现瓶颈。它看起来像这样:
##
## count the number of GCs in the characters between start and stop
##
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]]
numGC = 0
for(j in st:sp){
##nested ifs faster than an OR (|) construction
if(chars[[j]] == "g"){
numGC <- numGC + 1
}else if(chars[[j]] == "G"){
numGC <- numGC + 1
}else if(chars[[j]] == "c"){
numGC <- numGC + 1
}else if(chars[[j]] == "C"){
numGC <- numGC + 1
}
}
return(numGC)
}
运行 Rprof 给我以下输出:
> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")
self.time self.pct total.time total.pct
"gcCount" 77.36 76.8 100.74 100.0
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.58 3.6 3.64 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$by.total
total.time total.pct self.time self.pct
"gcCount" 100.74 100.0 77.36 76.8
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.64 3.6 3.58 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$sampling.time
[1] 100.74
有什么建议可以让这段代码更快吗?