r - Efficient way to split a vector of cigars using mclapply

Question

I have a very large vector of cigars:

my.vector = c("44M2D1I","32M465N3M", "3S4I3D45N65M")

That I'd like to transform to a vector of splitted cigars - the logic is as follows: whenever I find a number followed by an N, I have to split it, that is why I splited "32M465N3M" to "32M","465N","3M"; and "3S4I3D45N65M" to "3S4I3D", "45N", "65M"; and "44M2D1I" did not get split because it had no "N" in it.

my.vector.split = c("44M2D1I, "32M", "465N", "3M", "3S4I3D", "45N", "65M").

My vector is very large so ideally I'd like to use the parallel capabilities of the cluster. I'd like to use mclapply with ncores.

Ideally, I'd like to define something like this:

 my.vector.split = unlist(mclapply(my.vector, my.splitting.function, mc.cores = ncores))

where the length of my.vector.split is length(my.vector) + (number of Ns)*2.

Note. The HPC cluster I am using does not have the latest bioconductor installed so I cannot use cigartoRleList, and other nice cigar operation tools.

score 1 · Accepted Answer

这应该适用。详细信息将根据您设置集群的方式而有所不同，但基本上这将返回一系列数据框。如果您希望它们作为向量，请unlist环绕它们：

 lapply(gsub("([[:digit:]]+N)", ",\\1,", my.vector) , 
         function(x) unlist( read.table(text=x,sep=",",colClasses="character")) )
#------------
[[1]]
       V1 
"44M2D1I" 

[[2]]
    V1     V2     V3 
 "32M" "465N"   "3M" 

[[3]]
      V1       V2       V3 
"3S4I3D"    "45N"    "65M"

r - Efficient way to split a vector of cigars using mclapply

1 回答 1

Related

Reference