Solution:
- uses
paste()
to collapse the vector elements together
- uses
fread()
to parse the collapsed vector into data.table/data.frame
As a function:
collapse2fread <- function(x,sep) {
require(data.table)
fread(paste0(x,collapse="\n"),sep=sep,header=FALSE)
}
Rcpp on top of that?
Could also try doing it in c++ via Rcpp
packages to get more out of it? Something like:
std::string collapse_cpp(CharacterVector subject,const std::string collapseBy){
int n = subject.size();
std::string collapsed;
for(int i=0;i<n;i++){
collapsed += std::string(subject[i]) + collapseBy;
}
return(collapsed);
}
Then we get:
collapse_cpp2fread <- function(x,sep) {
require(data.table)
fread(collapse_cpp(x,collapse="\n"),sep=sep,header=FALSE)
}
quick test for the cpp fxn
microbenchmark(
paste0(words,collapse="\n"),
collapse_cpp(words,"\n"),
times=100)
not much but it's someting:
> Unit: microseconds
> expr min lq median uq max neval
> paste0(words, collapse = "\\n") 7.297 7.7695 8.162 8.4255 33.824 100
> collapse_cpp(words, "\\n") 4.477 5.0095 5.117 5.3525 17.052 100
Comparison to strsplit method:
Make a more realistic input
words <- rep(paste0(letters[1:8], collapse = '_'), 1e5) # 100K elements
benchmark:
microbenchmark(
do.call(rbind, strsplit(words, '_')),
fread(paste0(words,collapse="\n"),sep="_",header=FALSE),
fread(collapse_cpp(words,"\n"),sep="_",header=FALSE),
times=10)
gives:
> Unit: milliseconds
> expr min lq median uq
> do.call(rbind, strsplit(words, "_")) 782.71782 796.19154 822.73694 854.22211
> fread(paste0(words, collapse = "\\n"), sep = "_", header = FALSE) 62.56164 64.13504 68.22512 71.96075
> fread(collapse_cpp(words, "\\n"), sep = "_", header = FALSE) 47.16362 47.78030 50.12867 52.23102
> max neval
> 863.0790 10
> 151.5969 10
> 109.9770 10
so about a 20x improvement at this size? hope it helps!