r - 从范围中提取整数

Question

在 R 中，从范围中提取整数的有效方法是什么？

假设我有一个范围矩阵（column1=start，column2=end）

1   5
3   6
10  13

我想将矩阵中所有范围的包含唯一整数存储到一个对象中：

这将应用于包含约 400 万个范围的矩阵，因此希望有人可以提供一种有效的解决方案。

score 12 · Accepted Answer

假设您有 start = 3，end = 7，并且您在从 1 开始的数轴上将每个标记为“1”

starts:     0 0 1 0 0 0 0 0 0 ...
ends + 1:   0 0 0 0 0 0 0 1 0 ...

开始的累积和减去结束的累积和，两者之差为

cumsum(starts):   0 0 1 1 1 1 1 1 1 ...
cumsum(ends + 1): 0 0 0 0 0 0 0 1 1 ...
diff:             0 0 1 1 1 1 1 0 0

差异中 1 的位置是

which(diff > 0): 3 4 5 6 7

使用制表允许在同一位置有多个开始/结束，并且

range2 <- function(ranges)
{
    max <- max(ranges)
    starts <- tabulate(ranges[,1], max)
    ends <- tabulate(ranges[,2] + 1L, max)
    which(cumsum(starts) - cumsum(ends) > 0L)
}

对于这个问题，这给出了

> eg <- matrix(c(1, 3, 10, 5, 6, 13), 3)
> range2(eg)
 [1]  1  2  3  4  5  6 10 11 12 13

以安德烈为例，它非常快

 > system.time(runs <- range2(xx))
   user  system elapsed 
  0.108   0.000   0.111

（这听起来有点像 DNA 序列分析，GenomicRanges可能是你的朋友；你会在读取上使用coverageandslice函数，也许用输入readGappedAlignments）。

score 5 · Accepted Answer

我不知道它是否特别有效，但如果你的范围矩阵是ranges那么以下应该工作：

unique(unlist(apply(ranges, 1, function(x) x[1]:x[2])))

score 5 · Accepted Answer

使用sequence和rep：

x <- matrix(c(1, 5, 3, 6, 10, 13), ncol=2, byrow=TRUE)

ranges <- function(x){
  len <- x[, 2] - x[, 1] + 1
  #allocate space
  a <- b <- vector("numeric", sum(len))
  a <- rep(x[, 1], len) 
  b <- sequence(len)-1
  unique(a+b)
}

ranges(x)
[1]  1  2  3  4  5  6 10 11 12 13

由于这仅使用矢量化代码，因此即使对于大型数据集，这也应该非常快。在我的机器上，一个 100 万行的输入矩阵需要大约 5 秒才能运行：

set.seed(1)
xx <- sample(1e6, 1e6)
xx <- matrix(c(xx, xx+sample(1:100, 1e6, replace=TRUE)), ncol=2)
str(xx)
 int [1:1000000, 1:2] 265509 372124 572853 908206 201682 898386 944670 660794 629110 61786 ...

system.time(zz <- ranges(xx))
user  system elapsed 
   4.33    0.78    5.22 

str(zz)
num [1:51470518] 265509 265510 265511 265512 265513 ...

score 3 · Accepted Answer

是不是很简单：

x <- matrix(c(1, 5, 3, 6, 10, 13), ncol=2, byrow=TRUE)
do.call(":",as.list(range(x)))
[1]  1  2  3  4  5  6  7  8  9 10 11 12 13

编辑

看起来我弄错了，但我的答案可以修改为使用union，虽然这只是一个包装unique：

Reduce("union",apply(x,1,function(y) do.call(":",as.list(y))))
[1]  1  2  3  4  5  6 10 11 12 13

r - 从范围中提取整数

4 回答 4

Related

Reference