r - 大字符串向量到data.frame

Question

我有一个大向量（100M 个元素）的单词类型：

words <- paste(letters,letters,letters,letters,sep="_")

（在实际的数据字中并不完全相同，而是长度均为 8）

我想将它们转换为一个数据框，该数据框对于单词的每个字母都有一列，每个单词都有一行。为此，我已经尝试过str_split_fixed结果rbind，但在大向量 R 上冻结/永远。

所以想要的形式输出：

      l1    l2    l3    l4
1     a     a     a     a  
2     b     b     b     b
3     c     c     c     c

有没有更快的方法来做到这一点？

score 7 · Accepted Answer

Solution:

uses paste() to collapse the vector elements together
uses fread() to parse the collapsed vector into data.table/data.frame

As a function:

collapse2fread <- function(x,sep) {

    require(data.table)
    fread(paste0(x,collapse="\n"),sep=sep,header=FALSE)
}

Rcpp on top of that?

Could also try doing it in c++ via Rcpp packages to get more out of it? Something like:

std::string collapse_cpp(CharacterVector subject,const std::string collapseBy){

     int n = subject.size();
     std::string collapsed;

     for(int i=0;i<n;i++){
         collapsed += std::string(subject[i]) + collapseBy;
    }
    return(collapsed);
}

Then we get:

collapse_cpp2fread <- function(x,sep) {

    require(data.table)
    fread(collapse_cpp(x,collapse="\n"),sep=sep,header=FALSE)
}

quick test for the cpp fxn

microbenchmark(
    paste0(words,collapse="\n"),
    collapse_cpp(words,"\n"),
    times=100)

not much but it's someting:

> Unit: microseconds
>                             expr   min     lq median     uq    max neval
>  paste0(words, collapse = "\\n") 7.297 7.7695  8.162 8.4255 33.824   100
>       collapse_cpp(words, "\\n") 4.477 5.0095  5.117 5.3525 17.052   100

Comparison to strsplit method:

Make a more realistic input

words <- rep(paste0(letters[1:8], collapse = '_'), 1e5) # 100K elements

benchmark:

microbenchmark(
    do.call(rbind, strsplit(words, '_')),
    fread(paste0(words,collapse="\n"),sep="_",header=FALSE),
    fread(collapse_cpp(words,"\n"),sep="_",header=FALSE),
    times=10)

gives:

> Unit: milliseconds
>                                                               expr       min        lq    median                  uq
>                               do.call(rbind, strsplit(words, "_")) 782.71782 796.19154 822.73694 854.22211
> fread(paste0(words, collapse = "\\n"), sep = "_", header = FALSE)  62.56164  64.13504  68.22512  71.96075
> fread(collapse_cpp(words, "\\n"), sep = "_", header = FALSE)  47.16362  47.78030  50.12867  52.23102
>      max neval
> 863.0790    10
> 151.5969    10
> 109.9770    10

so about a 20x improvement at this size? hope it helps!

score 3 · Accepted Answer

对基于 Rcpp 的解决方案进行了一些扩展。如果您可以假设输入的结构，那么很容易在 Rcpp 中以最少的数据复制完成所有操作。

// [[Rcpp::export]]
List bazinga( CharacterVector txt, int nc ){
    int n = txt.size() ;

    std::vector<CharacterVector> columns(nc) ;
    for( int i=0; i<nc; i++){
        columns[i] = CharacterVector(n) ;    
    }

    std::string tmp ;
    for( int i=0; i<n; i++){
        const char* p = txt[i];
        for(int j=0; j<nc; j++){
            tmp = *p ;
            columns[j][i] = tmp ;
            p +=2 ;
        }
    }

    List out = wrap(columns) ;
    return out ;
}

我得到：

> microbenchmark(f(), bazinga(words, 8), collapse2fread(words,
+     "_"), collapse_cpp2fread(words, "_"), times = 10)
Unit: milliseconds
                           expr       min        lq    median         uq          max neval
                            f() 830.21571 871.38955 899.07207 1001.18561   1299.15783    10
              bazinga(words, 8)  26.26454  30.61620  33.37360   46.24160     64.09243    10
     collapse2fread(words, "_")  59.96217  61.58535  67.20007   93.61615     97.85007    10
 collapse_cpp2fread(words, "_")  46.79471  48.58391  49.99636   82.69684    119.88587    10

score 1 · Accepted Answer

如果您使用的是类 Unix，则应该利用命令行。在那里处理大数据通常更快，然后在减少后将其带入 R。在这里，我将向量写入文件，然后在R 函数words中使用 Unix 命令重新编写它。system

> words <- rep(paste0(letters[1:8], collapse = '_'), 1e5)

> cat(words, file = 'out.txt', sep = '\n')
> write.table(system(' cat out.txt | tr "_" " " ', intern = TRUE),
              row.names = FALSE, col.names = FALSE, 
              quote = FALSE, file = 'out.txt')

> head(read.table('out.txt'))
#   V1 V2 V3 V4 V5 V6 V7 V8
# 1  a  b  c  d  e  f  g  h
# 2  a  b  c  d  e  f  g  h
# 3  a  b  c  d  e  f  g  h
# 4  a  b  c  d  e  f  g  h
# 5  a  b  c  d  e  f  g  h
# 6  a  b  c  d  e  f  g  h

和典型的 Rdo.call(rbind, ...)方法：

f <- function()
{
    x <- do.call(rbind, strsplit(words, '_'))
    y <- data.frame(x)
    names(y) <- paste0('l', ncol(y))
    return(y)
}

> microbenchmark(f())
# Unit: milliseconds
#  expr      min      lq   median      uq      max neval
#   f() 818.2391 959.088 964.1105 989.081 997.8625   100

r - 大字符串向量到data.frame

3 回答 3

Solution:

Rcpp on top of that?

quick test for the cpp fxn

Comparison to strsplit method:

Related

Reference