r - 如何从 GenomicRanges 对象中获取不同/唯一的行

Question

我用这个创建了以下GenomicRanges对象：

library(GenomicRanges)
gr <- GRanges(seqnames = "chr1", strand = c("+", "-","-", "+"),ranges = IRanges(start = c(1,3,3,5), width = 3))
gr

看起来像这样：

GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-3      +
  [2]     chr1       3-5      -
  [3]     chr1       3-5      -
  [4]     chr1       5-7      +

我想要做的是从那里获得唯一的行，产生这个（手工编码）

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-3      +
  [2]     chr1       3-5      -
  [3]     chr1       5-7      +

我怎样才能做到这一点？实际上，我有大约 900 万行要处理。

我可以使用这种方法，但速度很慢：

 library(tidyverse)
 gr %>% 
   as.tibble() %>% 
   distinct()

score 1 · Accepted Answer

您可以使用unique返回唯一行：

library(GenomicRanges)

gr <- GRanges(seqnames = "chr1", strand = c("+", "-","-", "+"),ranges = IRanges(start = c(1,3,3,5), width = 3))
unique(gr)
#> GRanges object with 3 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       1-3      +
#>   [2]     chr1       3-5      -
#>   [3]     chr1       5-7      +
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

如果您将对象转换为 data.frame（如在您的 tidyverse 解决方案中），data.tableunique可能会更快：

library(data.table)

unique(as.data.table(gr))
#>    seqnames start end width strand
#> 1:     chr1     1   3     3      +
#> 2:     chr1     3   5     3      -
#> 3:     chr1     5   7     3      +

score 0 · Accepted Answer

您可以通过索引来做到这一点：

gr[paste0(gr$seqnames, gr$ranges, gr$strand) %in% unique(paste0(gr$seqnames, gr$ranges, gr$strand)), ]

paste0(gr$seqnames, gr$ranges, gr$strand)将行绑定在一起成为一个可识别的字符串，表示变量的不同组合。然后根据这些行是否唯一%in%来创建TRUE/的向量。FALSE如果没有，[]（索引）将删除它们。

r - 如何从 GenomicRanges 对象中获取不同/唯一的行

2 回答 2

Related

Reference