3

When using the as.data.table() function from the data.table package to create a data.table out of a GRanges object, the default method is called (https://github.com/Rdatatable/data.table/blob/master/R/as.data.table.R#L8-L10).

This leads to the problem that the new data.table modifies the original data inside the GRanges by reference, leading to unexpected behavior.

I have a minimal example showing this:

library(GenomicRanges)
library(data.table)

## create small Granges object
gr <- GRanges(seqnames = "chr10",
          ranges = IRanges(start = c(1:10),end = c(11:20)),
          strand = "*")

## coerce to data.table (should return a copy of original data, shouldn't it?)
dt <- as.data.table(gr)

## modify using functional form assignment by reference 
dt[1:3, `:=`(seqnames=unique(seqnames),
                         start=min(start),
                         end=max(end),
                         #width=2,
                         strand = "*")]

## the data.table is changed,
dt
#    seqnames start end width strand
#1:    chr10     1  13    11      *
#2:    chr10     1  13    11      *
#3:    chr10     1  13    11      *
#4:    chr10     4  14    11      *
#5:    chr10     5  15    11      *
#6:    chr10     6  16    11      *
#7:    chr10     7  17    11      *
#8:    chr10     8  18    11      *
#9:    chr10     9  19    11      *
#10:    chr10    10  20    11      *

## but Granges is modified as well (somehow end(gr[1:3]) equals dt[1:3]$width).
gr
#GRanges object with 10 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr10  [ 1, 11]      *
#   [2]    chr10  [ 1, 11]      *
#   [3]    chr10  [ 1, 11]      *
#   [4]    chr10  [ 4, 14]      *
#   [5]    chr10  [ 5, 15]      *
#   [6]    chr10  [ 6, 16]      *
#   [7]    chr10  [ 7, 17]      *
#   [8]    chr10  [ 8, 18]      *
#   [9]    chr10  [ 9, 19]      *
#  [10]    chr10  [10, 20]      *
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths


## this gets even clearer when modifying dt$width
dt[4:6, `:=`(#seqnames=unique(seqnames),
         #start=min(start),
         #end=max(end),
         width=2#,
         #strand = "*"
         )]

## changing the width in data.table adapts the end of the gr
gr
#GRanges object with 10 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr10  [ 1, 11]      *
#   [2]    chr10  [ 1, 11]      *
#   [3]    chr10  [ 1, 11]      *
#   [4]    chr10  [ 4,  5]      *
#   [5]    chr10  [ 5,  6]      *
#   [6]    chr10  [ 6,  7]      *
#   [7]    chr10  [ 7, 17]      *
#   [8]    chr10  [ 8, 18]      *
#   [9]    chr10  [ 9, 19]      *
#  [10]    chr10  [10, 20]      *
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Right now my solution is to create a copy before continuing to modify the data.table:

gr2 <- GRanges(seqnames = "chr10",
          ranges = IRanges(start = c(1:10),end = c(11:20)),
          strand = "*")

dt2 <- copy(as.data.table(gr))

dt2[4:6, `:=`(width=2)]

dt2
#    seqnames start end width strand
# 1:    chr10     1  11    11      *
# 2:    chr10     1  11    11      *
# 3:    chr10     1  11    11      *
# 4:    chr10     4   5     2      *
# 5:    chr10     5   6     2      *
# 6:    chr10     6   7     2      *
# 7:    chr10     7  17    11      *
# 8:    chr10     8  18    11      *
# 9:    chr10     9  19    11      *
#10:    chr10    10  20    11      *

gr2
#GRanges object with 10 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr10  [ 1, 11]      *
#   [2]    chr10  [ 2, 12]      *
#   [3]    chr10  [ 3, 13]      *
#   [4]    chr10  [ 4, 14]      *
#   [5]    chr10  [ 5, 15]      *
#   [6]    chr10  [ 6, 16]      *
#   [7]    chr10  [ 7, 17]      *
#   [8]    chr10  [ 8, 18]      *
#   [9]    chr10  [ 9, 19]      *
#  [10]    chr10  [10, 20]      *
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

But since not everybody is aware of this behavior I wanted to post this here, maybe there is an easy way to prevent this behavior.

sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4    genomation_1.8.0     methylKit_1.2.0      GenomicRanges_1.28.4 GenomeInfoDb_1.12.2  IRanges_2.10.2       S4Vectors_0.14.3    
[8] BiocGenerics_0.22.0 

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.6.3 qvalue_2.8.0               gtools_3.5.0               reshape2_1.4.2             splines_3.4.0             
 [6] lattice_0.20-35            seqPattern_1.8.0           colorspace_1.3-2           rtracklayer_1.36.4         XML_3.98-1.9              
[11] rlang_0.1.1                R.oo_1.21.0                R.utils_2.5.0              BiocParallel_1.10.1        fastseg_1.22.0            
[16] matrixStats_0.52.2         GenomeInfoDbData_0.99.0    plyr_1.8.4                 stringr_1.2.0              zlibbioc_1.22.0           
[21] Biostrings_2.44.2          munsell_0.4.3              gtable_0.2.0               R.methodsS3_1.7.1          coda_0.19-1               
[26] Biobase_2.36.2             Rcpp_0.12.11               KernSmooth_2.23-15         readr_1.1.1                scales_0.4.1              
[31] BSgenome_1.44.0            limma_3.32.3               plotrix_3.6-5              DelayedArray_0.2.7         XVector_0.16.0            
[36] Rsamtools_1.28.0           impute_1.50.1              hms_0.3                    ggplot2_2.2.1              stringi_1.1.5             
[41] numDeriv_2016.8-1          tools_3.4.0                bitops_1.0-6               bbmle_1.0.19               magrittr_1.5              
[46] lazyeval_0.2.0             RCurl_1.95-4.8             tibble_1.3.3               MASS_7.3-47                Matrix_1.2-9              
[51] gridBase_0.4-7             emdbook_1.3.9              R6_2.2.2                   mclust_5.3                 GenomicAlignments_1.12.1  
[56] compiler_3.4.0   
4

1 回答 1

0

这是一个可行的解决方案:

dt = data.table(as.data.frame(gr))
于 2017-07-26T15:15:35.280 回答