r - 查找定义的范围集之外的所有范围

Question

我想知道定义给定范围集未涵盖的所有范围的最佳方法是什么。例如，如果我有一组已知坐标的基因：

dtGenes <- fread(
  "id,start,end
 1,1000,1300
 2,1200,1500
 3,1600,2600
 4,3000,4000
")

假设我知道染色体的总长度（为简单起见，假设它们都在同一条染色体上）是 10000。所以，最后我希望有以下基因间区域列表：

"startR,endR
    0,1000
 1500,1600
 2600,3000
 4000,10000
"

BioconductorIRange在这里有用吗？还是有其他解决这个问题的好方法？

score 4 · Accepted Answer

使用 Bioconductor 包GenomicRanges，将您的原始数据转换为GRanges

library(GenomicRanges)
gr <- with(dtGenes, GRanges("chr1", IRanges(start, end, names=id),
                            seqlengths=c(chr1=10000)))

然后找到你的基因之间的差距

gaps <- gaps(gr)

GRanges知道链。您没有在GRanges构造函数中指定 strand ，因此 strand 被赋值*。因此 +、- 和 * 链上存在“间隙”，您只对 * 链上的那些感兴趣

> gaps[strand(gaps) == "*"]
GRanges with 4 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1 [   1,   999]      *
  [2]     chr1 [1501,  1599]      *
  [3]     chr1 [2601,  2999]      *
  [4]     chr1 [4001, 10000]      *
  ---
  seqlengths:
    chr1
   10000

请注意 Bioconductor 约定，染色体从 1 开始，并且范围是封闭的——start和end坐标包含在范围内。使用shift和narrowongr使您的范围与 Bioconductor 约定一致。GRanges 操作在数以百万计的范围内是有效的。

score 1 · Accepted Answer

您可以reduce从IRanges包中使用

reduce 首先将 x 中的范围从左到右排序，然后合并重叠或相邻的范围。

library(IRanges)
dat <- read.csv(text="id,start,end
 1,1000,1300
 2,1200,1500
 3,1600,2600
 4,3000,4000
")

ir <- IRanges(dat$start,dat$end)
rir <- reduce(ir)
IRanges of length 3
    start  end width
[1]  1000 1500   501
[2]  1600 2600  1001
[3]  3000 4000  1001

r - 查找定义的范围集之外的所有范围

2 回答 2

Related

Reference