我有 GRanges 对象(所有基因外显子的坐标);coding_pos
定义特定外显子中密码子的起始位置(1 表示外显子中的第一个核苷酸也是密码子中的第一个 nt,依此类推)。
grTargetGene本身看起来像这样
> grTargetGene
GRanges object with 11 ranges and 7 metadata columns:
seqnames ranges strand | ensembl_ids gene_biotype prev_exons_length coding_pos
<Rle> <IRanges> <Rle> | <character> <character> <numeric> <numeric>
[1] chr2 [148602722, 148602776] + | ENSG00000121989 protein_coding 0 1
[2] chr2 [148653870, 148654077] + | ENSG00000121989 protein_coding 55 2
[3] chr2 [148657027, 148657136] + | ENSG00000121989 protein_coding 263 3
[4] chr2 [148657313, 148657467] + | ENSG00000121989 protein_coding 373 2
[5] chr2 [148672760, 148672903] + | ENSG00000121989 protein_coding 528 1
[6] chr2 [148674852, 148674995] + | ENSG00000121989 protein_coding 672 1
[7] chr2 [148676016, 148676161] + | ENSG00000121989 protein_coding 816 1
[8] chr2 [148677799, 148677913] + | ENSG00000121989 protein_coding 962 3
[9] chr2 [148680542, 148680680] + | ENSG00000121989 protein_coding 1077 1
[10] chr2 [148683600, 148683730] + | ENSG00000121989 protein_coding 1216 2
[11] chr2 [148684649, 148684843] + | ENSG00000121989 protein_coding 1347 1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
我有兴趣分别查看每个密码子和 [3] 中 [1,2] 位置的坐标。换句话说,我想要 2 个不同的 GRanges 对象,它们看起来大致像这样(这里只是开始)
> grTargetGene_Nonsynonym
GRanges object with X ranges and 7 metadata columns:
seqnames ranges strand | ensembl_ids gene_biotype
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr2 [148602722, 148602723] + | ENSG00000121989 protein_coding
[2] chr2 [148602725, 148602726] + | ENSG00000121989 protein_coding
[3] chr2 [148602728, 148602729] + | ENSG00000121989 protein_coding
[4] chr2 [148602731, 148602732] + | ENSG00000121989 protein_coding
> grTargetGene_Synonym
GRanges object with X ranges and 7 metadata columns:
seqnames ranges strand | ensembl_ids gene_biotype
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr2 [148602724, 148602724] + | ENSG00000121989 protein_coding
[2] chr2 [148602727, 148602727] + | ENSG00000121989 protein_coding
[3] chr2 [148602730, 148602730] + | ENSG00000121989 protein_coding
[4] chr2 [148602733, 148602733] + | ENSG00000121989 protein_coding
我打算通过根据 and 为每个外显子创建一组 grange 的循环来完成它coding_pos
,strand
但我怀疑有一种更聪明的方法,甚至可能有一个函数可以做到这一点,但我找不到一个简单的解决方案.
重要提示:我不需要序列本身(在这种情况下,最简单的方法是先提取 DNA,然后使用序列),但我不需要这样做,我只需要我将用于与某些重叠的位置特征。
> library("GenomicRanges")
> dput(grTargetGene)
new("GRanges"
, seqnames = new("Rle"
, values = structure(1L, .Label = "chr2", class = "factor")
, lengths = 6L
, elementMetadata = NULL
, metadata = list()
)
, ranges = new("IRanges"
, start = c(148602722L, 148653870L, 148657027L, 148657313L, 148672760L,
148674852L)
, width = c(55L, 208L, 110L, 155L, 144L, 144L)
, NAMES = NULL
, elementType = "integer"
, elementMetadata = NULL
, metadata = list()
)
, strand = new("Rle"
, values = structure(1L, .Label = c("+", "-", "*"), class = "factor")
, lengths = 6L
, elementMetadata = NULL
, metadata = list()
)
, elementMetadata = new("DataFrame"
, rownames = NULL
, nrows = 6L
, listData = structure(list(ensembl_ids =
c("ENSG00000121989","ENSG00000121989",
"ENSG00000121989", "ENSG00000121989", "ENSG00000121989", "ENSG00000121989"
), gene_biotype = c("protein_coding", "protein_coding", "protein_coding",
"protein_coding", "protein_coding", "protein_coding"), cds_length =
c(1542,1542, 1542, 1542, 1542, 1542), gene_start_position = c(148602086L,
148602086L, 148602086L, 148602086L, 148602086L, 148602086L),
gene_end_position = c(148688393L, 148688393L, 148688393L,
148688393L, 148688393L, 148688393L), prev_exons_length = c(0,
55, 263, 373, 528, 672), coding_pos = c(1, 2, 3, 2, 1, 1)), .Names =
c("ensembl_ids", "gene_biotype", "cds_length", "gene_start_position",
"gene_end_position",
"prev_exons_length", "coding_pos"))
, elementType = "ANY"
, elementMetadata = NULL
, metadata = list()
)
, seqinfo = new("Seqinfo"
, seqnames = "chr2"
, seqlengths = NA_integer_
, is_circular = NA
, genome = NA_character_
)
, metadata = list()
)