ensembldb
将创建一个非常容易查询的本地集成数据库。我也很喜欢plyranges
,它允许我们使用类 SQL(和类 dplyr)语言对GenomicRanges
对象进行连接,这是基因组坐标的默认 Bioconductor 类。
首先,我们将调用/安装必要的软件包:
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(AnnotationHub)) BiocManager::install("AnnotationHub")
if(!require(ensembldb)) BiocManager::install("ensembldb")
if(!require(plyranges)) BiocManager::install("plyranges")
在这里,我查询AnnotationHub
找到我们要下载的特定数据库,然后我们下载它。
ah <- AnnotationHub()
#> snapshotDate(): 2020-04-27
query <- query(ah, "EnsDb", "Sus scrofa")
tibble(id = query$ah_id, title = query$title) %>%
dplyr::filter(str_detect(title, "Sus scrofa"))
#> # A tibble: 7 x 2
#> id title
#> <chr> <chr>
#> 1 AH64991 Ensembl 94 EnsDb for Sus scrofa
#> 2 AH68020 Ensembl 95 EnsDb for Sus scrofa
#> 3 AH69274 Ensembl 96 EnsDb for Sus scrofa
#> 4 AH73969 Ensembl 97 EnsDb for Sus scrofa
#> 5 AH75101 Ensembl 98 EnsDb for Sus scrofa
#> 6 AH78892 Ensembl 99 EnsDb for Sus scrofa
#> 7 AH79797 Ensembl 100 EnsDb for Sus scrofa
edb <- ah[["AH79797"]]
#> loading from cache
这里我做了一个小例子 Granges 对象:
#-- Example data
gen_regions <- data.frame(chr = c(1,1,1,2),
start = c(10000, 20000, 1000000,0),
end = c(11000, 21000, 1001000,100000)) %>%
GRanges()
然后,我们genes
从我们的 ensembl 数据库中获取。您可以在此处替换transcripts
不那么严格的。
genes <- genes(edb)
最后,一步加入我们的基因组区域和带注释的基因信息:
gene_overlap <- join_overlap_left(gen_regions, genes)
gene_overlap
#> GRanges object with 12 ranges and 8 metadata columns:
#> seqnames ranges strand | gene_id gene_name
#> <Rle> <IRanges> <Rle> | <character> <character>
#> [1] 1 10000-11000 * | ENSSSCG00000037372
#> [2] 1 20000-21000 * | ENSSSCG00000037372
#> [3] 1 1000000-1001000 * | <NA> <NA>
#> [4] 2 0-100000 * | ENSSSCG00000014554
#> [5] 2 0-100000 * | ENSSSCG00000014555 ODF3
#> ... ... ... ... . ... ...
#> [8] 2 0-100000 * | ENSSSCG00000014565
#> [9] 2 0-100000 * | ENSSSCG00000014558 SIRT3
#> [10] 2 0-100000 * | ENSSSCG00000014559 PSMD13
#> [11] 2 0-100000 * | ENSSSCG00000014561 NLRP6
#> [12] 2 0-100000 * | ENSSSCG00000025023 PGGHG
#> gene_biotype seq_coord_system
#> <character> <character>
#> [1] protein_coding chromosome
#> [2] protein_coding chromosome
#> [3] <NA> <NA>
#> [4] protein_coding chromosome
#> [5] protein_coding chromosome
#> ... ... ...
#> [8] protein_coding chromosome
#> [9] protein_coding chromosome
#> [10] protein_coding chromosome
#> [11] protein_coding chromosome
#> [12] protein_coding chromosome
#> description
#> <character>
#> [1] NULL
#> [2] NULL
#> [3] <NA>
#> [4] secretoglobin family 1C member 1 [Source:NCBI gene;Acc:100517148]
#> [5] outer dense fiber of sperm tails 3 [Source:NCBI gene;Acc:100499231]
#> ... ...
#> [8] interferon-induced transmembrane protein 1 [Source:NCBI gene;Acc:100127358]
#> [9] sirtuin 3 [Source:NCBI gene;Acc:100125971]
#> [10] proteasome 26S subunit, non-ATPase 13 [Source:NCBI gene;Acc:100517815]
#> [11] NLR family pyrin domain containing 6 [Source:NCBI gene;Acc:100519245]
#> [12] protein-glucosylgalactosylhydroxylysine glucosidase [Source:NCBI gene;Acc:100518005]
#> gene_id_version symbol entrezid
#> <character> <character> <list>
#> [1] ENSSSCG00000037372.2 NA
#> [2] ENSSSCG00000037372.2 NA
#> [3] <NA> <NA>
#> [4] ENSSSCG00000014554.4 100517148
#> [5] ENSSSCG00000014555.4 ODF3 100499231
#> ... ... ... ...
#> [8] ENSSSCG00000014565.3 100519082,100127358
#> [9] ENSSSCG00000014558.4 SIRT3 100125971
#> [10] ENSSSCG00000014559.4 PSMD13 100517815
#> [11] ENSSSCG00000014561.5 NLRP6 100519245
#> [12] ENSSSCG00000025023.3 PGGHG 100518005
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
该表已经是“整洁”的格式,但您可能希望将其汇总为每个区域都有一行。以下是如何做到这一点的建议:
per_region_genes <- gene_overlap %>%
as_tibble() %>%
mutate(gene_name = ifelse(gene_name == "", NA, gene_name)) %>%
group_by(seqnames, start, end) %>%
summarize(gene_ids = paste(na.exclude(gene_id), collapse = ", "),
gene_names = paste(na.exclude(gene_name), collapse = ", "), .groups = "drop")
per_region_genes
#> # A tibble: 4 x 5
#> seqnames start end gene_ids gene_names
#> <fct> <int> <int> <chr> <chr>
#> 1 1 10000 1.10e4 "ENSSSCG00000037372" ""
#> 2 1 20000 2.10e4 "ENSSSCG00000037372" ""
#> 3 1 1000000 1.00e6 "" ""
#> 4 2 0 1.00e5 "ENSSSCG00000014554, ENSSSCG000… "ODF3, RIC8A, SIRT3,…