0

我有一个GRanges对象,每个对象都有不同的基因组范围seqnames(例如染色体)。
我怎样才能得到一个GRanges只包含每个 seqname/chromosome 的最长范围?

例如,如果grGRanges

library(GenomicRanges)

# Make a GRanges object
set.seed(123)
gr <- GRanges(seqnames = rep(c("chr1", "chr2", "chr3"), times=2:4),
              ranges = IRanges(start=sample.int(10000, 9), 
                               width = c(3,5,50,20,10,500,100,500,200)))

# Add a column with the width for clarity:
mcols(gr)$width <- width(gr)

gr
#GRanges object with 9 ranges and 1 metadata column:
#      seqnames    ranges strand |     width
#         <Rle> <IRanges>  <Rle> | <integer>
#  [1]     chr1 2463-2465      * |         3
#  [2]     chr1 2511-2515      * |         5
#  [3]     chr2 8718-8767      * |        50
#  [4]     chr2 2986-3005      * |        20
#  [5]     chr2 1842-1851      * |        10
#  [6]     chr3 9334-9833      * |       500
#  [7]     chr3 3371-3470      * |       100
#  [8]     chr3 4761-5260      * |       500
#  [9]     chr3 6746-6945      * |       200
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths

然后我想获得以下内容GRanges

#GRanges object with 3 ranges and 1 metadata column:
#    seqnames      ranges strand |     width
#       <Rle>   <IRanges>  <Rle> | <integer>
#  [1]     chr1 2511-2515      * |         5
#  [2]     chr2 8718-8767      * |        50
#  [3]     chr3 9334-9833      * |       500

对于我的应用程序,我可以只获得第一个最长的范围,chr3但我希望有一个解决方案也可以选择所有关系(如果有的话)。

4

1 回答 1

0

我找到了这个解决方案:

library(GenomicRanges)

# Make the GRanges object
set.seed(123)
gr <- GRanges(seqnames = rep(c("chr1", "chr2", "chr3"), times=2:4),
              ranges = IRanges(start=sample.int(10000, 9), 
                               width = c(3,5,50,20,10,500,100,500,200)))

# Add a column with the width for clarity:
mcols(gr)$width <- width(gr)

# split gr by seqnames
grl <- split(gr, seqnames(gr))

# Get the width of each range organized as an IntegerList
wgr <- width(grl)
# if gr is ordered by seqnames this is equivalent to:
wgr <- splitAsList(width(gr), seqnames(gr))
wgr
#IntegerList of length 3
#[["chr1"]] 3 5
#[["chr2"]] 50 20 10
#[["chr3"]] 500 100 500 200

# Get the ranges with the largest width
# See ?'which.max,NumericList-method' for the global argument
longestRanges <- gr[which.max(wgr, global = TRUE)]

longestRanges
#GRanges object with 3 ranges and 1 metadata column:
#      seqnames      ranges strand |     width
#         <Rle>   <IRanges>  <Rle> | <integer>
#  [1]     chr1 2511-2515      * |         5
#  [2]     chr2 8718-8767      * |        50
#  [3]     chr3 9334-9833      * |       500
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths

请注意,此解决方案仅在 中的范围gr最初按 分组/排序时才有效seqnames
如果不是,则有必要从排序gr开始gr <- sort(gr)

要获得所有联系,可以使用:

longestRanges <- unlist(grl[which(wgr == max(wgr))])

longestRanges
#GRanges object with 4 ranges and 1 metadata column:
#       seqnames    ranges strand |     width
#          <Rle> <IRanges>  <Rle> | <integer>
#  chr1     chr1 2511-2515      * |         5
#  chr2     chr2 8718-8767      * |        50
#  chr3     chr3 9334-9833      * |       500
#  chr3     chr3 4761-5260      * |       500
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths
于 2020-02-20T01:11:24.897 回答