When using the as.data.table() function from the data.table package to create a data.table out of a GRanges object, the default method is called (https://github.com/Rdatatable/data.table/blob/master/R/as.data.table.R#L8-L10).
This leads to the problem that the new data.table modifies the original data inside the GRanges by reference, leading to unexpected behavior.
I have a minimal example showing this:
library(GenomicRanges)
library(data.table)
## create small Granges object
gr <- GRanges(seqnames = "chr10",
ranges = IRanges(start = c(1:10),end = c(11:20)),
strand = "*")
## coerce to data.table (should return a copy of original data, shouldn't it?)
dt <- as.data.table(gr)
## modify using functional form assignment by reference
dt[1:3, `:=`(seqnames=unique(seqnames),
start=min(start),
end=max(end),
#width=2,
strand = "*")]
## the data.table is changed,
dt
# seqnames start end width strand
#1: chr10 1 13 11 *
#2: chr10 1 13 11 *
#3: chr10 1 13 11 *
#4: chr10 4 14 11 *
#5: chr10 5 15 11 *
#6: chr10 6 16 11 *
#7: chr10 7 17 11 *
#8: chr10 8 18 11 *
#9: chr10 9 19 11 *
#10: chr10 10 20 11 *
## but Granges is modified as well (somehow end(gr[1:3]) equals dt[1:3]$width).
gr
#GRanges object with 10 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr10 [ 1, 11] *
# [2] chr10 [ 1, 11] *
# [3] chr10 [ 1, 11] *
# [4] chr10 [ 4, 14] *
# [5] chr10 [ 5, 15] *
# [6] chr10 [ 6, 16] *
# [7] chr10 [ 7, 17] *
# [8] chr10 [ 8, 18] *
# [9] chr10 [ 9, 19] *
# [10] chr10 [10, 20] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
## this gets even clearer when modifying dt$width
dt[4:6, `:=`(#seqnames=unique(seqnames),
#start=min(start),
#end=max(end),
width=2#,
#strand = "*"
)]
## changing the width in data.table adapts the end of the gr
gr
#GRanges object with 10 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr10 [ 1, 11] *
# [2] chr10 [ 1, 11] *
# [3] chr10 [ 1, 11] *
# [4] chr10 [ 4, 5] *
# [5] chr10 [ 5, 6] *
# [6] chr10 [ 6, 7] *
# [7] chr10 [ 7, 17] *
# [8] chr10 [ 8, 18] *
# [9] chr10 [ 9, 19] *
# [10] chr10 [10, 20] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
Right now my solution is to create a copy before continuing to modify the data.table:
gr2 <- GRanges(seqnames = "chr10",
ranges = IRanges(start = c(1:10),end = c(11:20)),
strand = "*")
dt2 <- copy(as.data.table(gr))
dt2[4:6, `:=`(width=2)]
dt2
# seqnames start end width strand
# 1: chr10 1 11 11 *
# 2: chr10 1 11 11 *
# 3: chr10 1 11 11 *
# 4: chr10 4 5 2 *
# 5: chr10 5 6 2 *
# 6: chr10 6 7 2 *
# 7: chr10 7 17 11 *
# 8: chr10 8 18 11 *
# 9: chr10 9 19 11 *
#10: chr10 10 20 11 *
gr2
#GRanges object with 10 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr10 [ 1, 11] *
# [2] chr10 [ 2, 12] *
# [3] chr10 [ 3, 13] *
# [4] chr10 [ 4, 14] *
# [5] chr10 [ 5, 15] *
# [6] chr10 [ 6, 16] *
# [7] chr10 [ 7, 17] *
# [8] chr10 [ 8, 18] *
# [9] chr10 [ 9, 19] *
# [10] chr10 [10, 20] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
But since not everybody is aware of this behavior I wanted to post this here, maybe there is an easy way to prevent this behavior.
sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4 genomation_1.8.0 methylKit_1.2.0 GenomicRanges_1.28.4 GenomeInfoDb_1.12.2 IRanges_2.10.2 S4Vectors_0.14.3
[8] BiocGenerics_0.22.0
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.6.3 qvalue_2.8.0 gtools_3.5.0 reshape2_1.4.2 splines_3.4.0
[6] lattice_0.20-35 seqPattern_1.8.0 colorspace_1.3-2 rtracklayer_1.36.4 XML_3.98-1.9
[11] rlang_0.1.1 R.oo_1.21.0 R.utils_2.5.0 BiocParallel_1.10.1 fastseg_1.22.0
[16] matrixStats_0.52.2 GenomeInfoDbData_0.99.0 plyr_1.8.4 stringr_1.2.0 zlibbioc_1.22.0
[21] Biostrings_2.44.2 munsell_0.4.3 gtable_0.2.0 R.methodsS3_1.7.1 coda_0.19-1
[26] Biobase_2.36.2 Rcpp_0.12.11 KernSmooth_2.23-15 readr_1.1.1 scales_0.4.1
[31] BSgenome_1.44.0 limma_3.32.3 plotrix_3.6-5 DelayedArray_0.2.7 XVector_0.16.0
[36] Rsamtools_1.28.0 impute_1.50.1 hms_0.3 ggplot2_2.2.1 stringi_1.1.5
[41] numDeriv_2016.8-1 tools_3.4.0 bitops_1.0-6 bbmle_1.0.19 magrittr_1.5
[46] lazyeval_0.2.0 RCurl_1.95-4.8 tibble_1.3.3 MASS_7.3-47 Matrix_1.2-9
[51] gridBase_0.4-7 emdbook_1.3.9 R6_2.2.2 mclust_5.3 GenomicAlignments_1.12.1
[56] compiler_3.4.0