2

使用 RI 时始终牢记:“尽可能避免使用循环”。但是,我现在被困住了,我无法找到一种疯狂的方式来编写我需要的代码。

郑重声明,经过几条评论,我上面的说法不是正确的说法,这里没有必要避免循环以提高效率。

我有两个字符串向量作为输入,让我们称它们为-它们只能包含字母,a和。b"M""I""D"

a = c("M","I","D","D","M","M","M","M","M","M")
b = c("M","M","M","M","M","M","D","M","M")

我想要的输出是:

d = c("M","I","D","D","M","M","M","M","I","M","M")

以下函数给了我这样的输出:

my.function <- function(a, b)
{
  nrow.df = length(a) + length(which(b=="D"))
  my.df = data.frame(a = rep(NA, nrow.df),  
                      b = rep(NA, nrow.df), 
                      d = rep(NA, nrow.df))
  my.df$a[1:length(a)] = a
  my.df$b[1:length(b)] = b
  for (i in 1:nrow.df)
  {
    if(my.df$a[i] == "D") {
      my.df$d[i] = "D"
      my.df$b[(i+1):nrow.df] = my.df$b[i:(nrow.df-1)]
    } else if (my.df$b[i] == "D") {
      my.df$d[i] = "I"
      my.df$a[(i+1):nrow.df] = my.df$a[i:(nrow.df-1)]
    } else if (my.df$a[i] == "I") {
      my.df$d[i] = "I"
    } else if (my.df$b[i] == "I") {
      my.df$d[i] = "D"
    } else {
      my.df$d[i] = my.df$a[i]
    }
  }
  return(my.df$d)
}

> d = my.function(a,b)
> d
 [1] "M" "I" "D" "D" "M" "M" "M" "M" "I" "M" "M"

函数逻辑如下,只要有 a "D"in a,就将 a "D"in放入并将d向量移位b1,反之亦然,只要有 a "D"in b,就将 a in 放入"I"d移位a1。

接下来,当有一个"I"in a,但没有一个"D"in 时b,放一个"I"in a,反之亦然,只要有一个"I"in b,而不是一个"D"in a,就放一个"D"in d。否则,d = a

它不是一个复杂的功能,但我正在努力使其 R 高效。我用 mclapply 应用了这个函数数百万次,所以快速实现这个函数可以为我节省很多时间。

你推荐使用 Rcpp 吗?会不会快很多?数百万次与 Cpp 通信 R 是否有任何减慢,或者它只是与 Rcpp 自动通信?

4

4 回答 4

4

我没有看到任何简单的方法来避免这里的循环。但是,还有一种更有效的方法可以做到这一点。问题是你实际上是在移动ab每次遇到字符D时,像这样移动一个向量是一个O(n)操作,所以这个循环的运行时间实际上是O(n^2)

您可以简化代码并获得更好的性能,如下所示:

f<-function(a,b){
 aSkipped<-0
 bSkipped<-0
 d<-rep(0,length(a)+sum(b=="D"))

 for(i in 1:length(d)){

    if(a[i-aSkipped] == "D") {
      d[i] = "D"
      bSkipped<-bSkipped+1
    } else if (b[i-bSkipped] == "D") {
      d[i] = "I"
      aSkipped<-aSkipped+1
    } else if (a[i-aSkipped] == "I") {
      d[i] = "I"
    } else if (b[i-bSkipped] == "I") {
      d[i] = "D"
    } else {
      d[i] = a[i-aSkipped]
    }
  }
  d
}

在编辑。当输入变大时,您将真正看到性能大幅提升。对于小字符串,并且没有太多“D”,这和 Ananda Mahto 的解决方案几乎同时运行:

> set.seed(123)
> a<-c(sample(c("M","I"),500,T))
> b<-c(sample(c("M","I"),500,T))
> a[sample(500,50)]<-"D"
> b[sample(500,50)]<-"D"
> microbenchmark(f(a,b),my.function.v(a,b))
Unit: milliseconds
                expr      min       lq   median       uq      max neval
             f(a, b) 4.259970 4.324046 4.368018 4.463925 9.694951   100
 my.function.v(a, b) 4.442873 4.497172 4.533196 4.639543 9.901044   100

但是对于长度为 50000 和 5000 个“D”的字符串,差异很大:

> set.seed(123)
> a<-c(sample(c("M","I"),50000,T))
> b<-c(sample(c("M","I"),50000,T))
> a[sample(50000,5000)]<-"D"
> b[sample(50000,5000)]<-"D"
> system.time(f(a,b))
   user  system elapsed 
  0.460   0.000   0.463 
> system.time(my.function.v(a,b))
   user  system elapsed 
  7.056   0.008   7.077 
于 2013-10-02T18:16:40.443 回答
4

根据我的评论,如果速度是一个问题,那么第一步是不要不必要地使用data.frames。这个答案没有解决循环问题(正如其他人已经说过的,如果正确完成,在 R 中使用循环没有任何问题)。

这是您的函数的一个非常略微修改的版本,使用vectors 而不是data.frames 来存储数据。

my.function.v <- function(a, b) {
  nrow.df = length(a) + length(which(b=="D"))
  A <- B <- D <- vector(length = nrow.df)
  A[1:length(a)] = a
  B[1:length(b)] = b
  for (i in 1:nrow.df)
  {
    if(A[i] == "D") {
      D[i] = "D"
      B[(i+1):nrow.df] = B[i:(nrow.df-1)]
    } else if (B[i] == "D") {
      D[i] = "I"
      A[(i+1):nrow.df] = A[i:(nrow.df-1)]
    } else if (A[i] == "I") {
      D[i] = "I"
    } else if (B[i] == "I") {
      D[i] = "D"
    } else {
      D[i] = A[i]
    }
  }
  return(D)
}

请注意以下速度的相对差异:

library(microbenchmark)
microbenchmark(my.function(a, b), my.function.v(a, b), f(a, b))
# Unit: microseconds
#                 expr      min        lq    median        uq      max neval
#    my.function(a, b) 1448.416 1490.8780 1511.3435 1547.3880 6674.332   100
#  my.function.v(a, b)  157.248  165.8725  171.6475  179.1865  324.722   100
#              f(a, b)  168.874  177.5455  184.8775  193.3455  416.551   100

可以看出,@mrip 的功能也比您原来的功能好得多。

于 2013-10-02T18:16:43.990 回答
2

好的,这是 Rcpp 解决方案,正如预期的那样,它大大超过了 R 解决方案:

rcppFun<-"
CharacterVector fcpp(CharacterVector a,CharacterVector b,int size){
int aSkipped = 0;
int bSkipped = 0;
int asize = a.size();
Rcpp::CharacterVector d(size);
for(int i=0; i<size; i++){
    if(i-aSkipped<asize && a[i-aSkipped][0] == 'D') {
      d[i] = \"D\";
      bSkipped++;
    } else if (b[i-bSkipped][0] == 'D') {
      d[i] = \"I\";
      aSkipped++;
    } else if (a[i-aSkipped][0] == 'I') {
      d[i] = \"I\";
    } else if (b[i-bSkipped][0] == 'I') {
      d[i] = \"D\";
    } else {
      d[i] = a[i-aSkipped];
    }
}
 return d;
}"
require("Rcpp")
fcpp<-cppFunction(rcppFun)

f3<-function(a,b){
  fcpp(a,b,as.integer(length(a)+sum(b=="D")))
}

警告:该函数根本不检查参数,所以如果你给它输入错误的数据,你很容易得到一个段错误。

如果你要经常调用这个,Rcpp 绝对是要走的路:

> with(ab(10),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
                expr     min       lq   median       uq     max neval
             f(a, b) 103.993 107.5155 108.6815 109.7455 178.801   100
            f3(a, b)   7.354   8.1305   8.5575   9.1220  18.014   100
            f2(a, b)  87.081  90.4150  92.2730  94.2585 146.502   100
 my.function.v(a, b)  84.389  86.5140  87.6090  88.8340 109.106   100
> with(ab(100),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
                expr     min        lq    median        uq      max neval
             f(a, b) 992.082 1018.9850 1032.0180 1071.0690 2784.710   100
            f3(a, b)  12.873   14.3605   14.7370   15.5095   35.582   100
            f2(a, b) 119.396  125.4405  129.3015  134.9915 1909.930   100
 my.function.v(a, b) 769.618  786.7865  802.2920  824.0820  905.737   100
> with(ab(1000),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
                expr      min        lq     median        uq       max neval
             f(a, b) 9816.295 10065.065 10233.1350 10392.696 12383.373   100
            f3(a, b)   66.057    67.869    83.9075    87.231  1167.086   100
            f2(a, b) 1637.972  1760.258  2667.6985  3138.229 47610.317   100
 my.function.v(a, b) 9692.885 10272.425 10997.2595 11402.602 54315.922   100
> with(ab(10000),microbenchmark(f(a,b),f3(a,b),f2(a,b)))
Unit: microseconds
     expr        min         lq      median          uq        max neval
  f(a, b) 101644.922 103311.678 105185.5955 108342.4960 144620.777   100
 f3(a, b)    607.702    610.039    669.8515    678.1845    785.415   100
 f2(a, b) 221305.641 247952.345 254478.1580 341195.5510 656408.378   100
> 
于 2013-10-02T20:45:13.437 回答
1

只是为了展示它是如何完成的,它可以在 R 中没有循环的情况下完成;这是一种方法。当长度大约为 1000 或更短时它会更快,但在更大时会更慢。一个要点是,您肯定可以在 Rcpp 中加快这一速度。

f2 <- function(a,b) {
  da <- which(a=="D")
  db <- which(b=="D")
  dif <- outer(da, db, `<`) 
  da <- da + rowSums(!dif)
  db <- db + colSums(dif)
  ia <- which(a=="I")  
  ia <- ia + colSums(outer(db, ia, `<`))
  ib <- which(b=="I")
  ib <- ib + colSums(outer(da, ib, `<`))
  out <- rep("M", length(a) + length(db))
  out[da] <- "D"
  out[db] <- "I"
  out[ia] <- "I"
  out[ib] <- "D"
  out
}

用于生成数据

ab <- function(N) {
  set.seed(123)
  a<-c(sample(c("M","I"),N,TRUE))
  b<-c(sample(c("M","I"),N,TRUE))
  a[sample(N,N/10)]<-"D"
  b[sample(N,N/10)]<-"D"
  list(a=a,b=b)
}

时间:

> library(microbenchmark)
> with(ab(10), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: microseconds
                expr    min       lq   median       uq     max neval
 my.function.v(a, b) 79.102  86.9005  89.3680  93.2410 279.761   100
             f(a, b) 84.334  91.1055  94.1790  98.2645 215.579   100
            f2(a, b) 94.807 101.5405 105.1625 108.9745 226.149   100

> with(ab(100), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: microseconds
                expr     min       lq  median       uq      max neval
 my.function.v(a, b) 732.849 750.4480 762.906 845.0835 1953.371   100
             f(a, b) 789.380 805.8905 819.022 902.5865 1921.064   100
            f2(a, b) 124.442 129.1450 134.543 137.5910  237.498   100

> with(ab(1000), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: milliseconds
                expr       min        lq    median        uq      max neval
 my.function.v(a, b) 10.146865 10.387144 10.695895 11.123164 13.08263   100
             f(a, b)  7.776286  7.973918  8.266882  8.633563  9.98204   100
            f2(a, b)  1.322295  1.355601  1.385302  1.465469  1.85349   100

> with(ab(10000), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b), times=10))
Unit: milliseconds
                expr      min        lq    median        uq       max neval
 my.function.v(a, b) 429.4030 435.00373 439.06706 442.51650 465.00124    10
             f(a, b)  80.7709  83.71715  85.14887  88.02067  89.00047    10
            f2(a, b) 164.7807 170.37608 175.94281 247.78353 251.14653    10
于 2013-10-02T19:25:01.047 回答