6

I have a vector of positive and negative numbers

vec<-c(seq(-100,-1), rep(0,20), seq(1,100))

the vector is larger than the example, and takes on a random set of values. I have to repetitively find the number of negative numbers in the vector... I am finding this is quite inefficient.

Since I only need to find the number of negative numbers, and the vector is sorted, I only need to know the index of the first 0 or positive number (there may be no 0s in the actual random vectors).

Currently I am using this code to find the length

length(which(vec<0))

but this forces R to go through the entire vector, but since it is sorted, there is no need.

I could use

match(0, vec)

but my vector does not always have 0s

So my question is, is there some kind of match() function that applies a condition instead of finding a specific value? Or is there a more efficient way to run my which() code?

4

3 回答 3

18

到目前为止提供的解决方案都意味着创建一个logical(length(vec))并对此进行全面或部分扫描。如您所见,向量已排序。我们可以通过二分搜索来利用这一点。我开始认为我会非常聪明并在 C 中实现它以获得更快的速度,但是在调试算法的索引时遇到了麻烦(这是棘手的部分!)。所以我用R写了它:

f3 <- function(x) {
    imin <- 1L
    imax <- length(x)
    while (imax >= imin) {
        imid <- as.integer(imin + (imax - imin) / 2)
        if (x[imid] >= 0)
            imax <- imid - 1L
        else
            imin <- imid + 1L
    }
    imax
}

为了与其他建议进行比较

f0 <- function(v) length(which(v < 0))
f1 <- function(v) sum(v < 0)
f2 <- function(v) which.min(v < 0) - 1L

为了好玩

library(compiler)
f3.c <- cmpfun(f3)

导致

> vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
> identical(f0(vec), f1(vec))
[1] TRUE
> identical(f0(vec), f2(vec))
[1] TRUE
> identical(f0(vec), f3(vec))
[1] TRUE
> identical(f0(vec), f3.c(vec))
[1] TRUE
> microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec))
Unit: microseconds
      expr       min        lq     median         uq       max neval
   f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903   100
   f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293   100
   f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889   100
   f3(vec)    51.715    56.050    75.4495    78.5295   100.730   100
 f3.c(vec)    11.612    17.147    28.5570    31.3160    49.781   100

可能有一些棘手的边缘情况我错了!搬到C,我做到了

library(inline)
f4 <- cfunction(c(x = "numeric"), "
    int imin = 0, imax = Rf_length(x) - 1, imid;
    while (imax >= imin) {
        imid = imin + (imax - imin) / 2;
        if (REAL(x)[imid] >= 0)
            imax = imid - 1;
        else
            imin = imid + 1;
    }
    return ScalarInteger(imax + 1);
")

> identical(f3(vec), f4(vec))
[1] TRUE
> microbenchmark(f3(vec), f3.c(vec), f4(vec))
Unit: nanoseconds
      expr   min      lq  median      uq   max neval
   f3(vec) 52096 53192.0 54918.5 55539.0 69491   100
 f3.c(vec) 10924 12233.5 12869.0 13410.0 20038   100
   f4(vec)   553   796.0   893.5  1004.5  2908   100

findInterval当在R-help列表上提出类似问题时,出现了。它缓慢但安全,检查vec实际排序并处理 NA 值。如果一个人想要生活在边缘(可以说不比实施 f3 或 f4 更糟),那么

f5.i <- function(v)
    .Internal(findInterval(v, 0 - .Machine$double.neg.eps, FALSE, FALSE))

几乎与 C 实现一样快,但可能更健壮和矢量化(即,在第二个参数中查找值的向量,以便于进行类似范围的计算)。

于 2013-04-25T20:42:06.397 回答
3

使用sum()和逻辑比较:

sum( vec < 0 )
[1] 100

这将非常快,当您对逻辑求和时,TRUE为 1 和FALSE0,因此总数将是负值的数量。

哦,我觉得需要进行基准比较... :-) 矢量长度为 2e5

library(microbenchmark)
vec<-c(seq(-100,-1,length.out=1e5), rep(0,20), seq(1,100,length.out=1e5))
microbenchmark( (which.min(vec < 0) - 1L) , (sum( vec < 0 )) )

Unit: milliseconds
                      expr      min       lq   median       uq       max neval
 (which.min(vec < 0) - 1L) 1.883847 2.130746 2.554725 3.141787 75.943911   100
            (sum(vec < 0)) 1.398100 1.500639 1.508688 1.745088  2.662164   100
于 2013-04-25T11:06:20.130 回答
2

你可以使用which.min

 which.min(vec < 0) - 1L

这将返回第一个FALSE值,即第一个 0。

于 2013-04-25T11:15:28.680 回答