r - ggplot2 histogram：如何使用 ggplot2 在直方图条上添加文本注释

Question

我正在使用具有以下标头名称的数据框：

> [1] "Filename" "Strain" "DNA_Source" "Locus_Tag" "Product" "Transl_Tbl" "Note" "Seq_AA" "Protein_ID"

使用以下代码，我得到一个图表，显示在特定细菌菌株中发现了多少基因：

png(filename=paste('images/Pangenome_Histogram.png', sep=''), width=3750,height=2750,res=300)
par(mar=c(9.5,4.3,4,2))
print(h <- ggplot(myDF, aes(x=Strain, stat='bin', fill=factor(Filename), label=myDF$Filename)) + geom_bar() +
      labs(title='Gene Count by Strain Pangenome', x='Campylobacter Strains', y='Gene Count\n') +
      guides(title.theme = element_text(size=15, angle = 90)) + theme(legend.text=element_text(size=15), text = element_text(size=18)) +
      theme(axis.text.x=element_text(angle=45, size=16, hjust=1), axis.text.y=element_text(size=16), legend.position='none', plot.title = element_text(size=22)) )

也许有点难以看到，但例如，一些菌株具有多色条 - 表明某些菌株的基因来自细菌染色体以外的来源（或者如果细菌具有多个染色体，则来自多个染色体） . 我想根据适当位置的基因来源（“DNA_Source”）标记条形图。

png(filename=paste('images/Pangenome_Histogram.png', sep=''), width=3750,height=2750,res=300)
par(mar=c(9.5,4.3,4,2))
print(h <- ggplot(myDF, aes(x=Strain, stat='bin', fill=factor(Filename), label=myDF$Filename)) + geom_bar() +
      labs(title='Gene Count by Strain Pangenome', x='Campylobacter Strains', y='Gene Count\n') +
      guides(title.theme = element_text(size=15, angle = 90)) + theme(legend.text=element_text(size=15), text = element_text(size=18)) +
  geom_text(aes(label=DNA_Source, y='identity'), color='black', vjust=-5, size=4) +
      theme(axis.text.x=element_text(angle=45, size=16, hjust=1), axis.text.y=element_text(size=16), legend.position='none', plot.title = element_text(size=22)) )

这让我很接近，但它从 y 轴上删除了计数（并在左下角添加了“身份”一词）并将贡献标记在彼此之上，以便它们无法读取，除非它是同一个词。

我希望 y 轴像第一个图像一样标记，标签在第二个图像中 -但我希望这些标签出现在直方图的相应颜色部分中（视觉上类似于此处：显示堆叠上的数据值ggplot2 中的条形图），但我想使用 ggplot2 包来完成它。

我希望这很清楚。帮助表示赞赏 - 所以提前感谢。

这是一些数据 (tail(dput(myDF[c(2, 3, 5)])))...

          Strain DNA_Source                             Product
12299 Campy3194c    Plasmid Type VI secretion protein, VC_A0111
12300 Campy3194c    Plasmid           Type VI secretion protein
12301 Campy3194c    Plasmid                              Tgh104
12302 Campy3194c    Plasmid                        protein ImpC
12303 Campy3194c    Plasmid           Type VI secretion protein
12304 Campy3194c    Chromosome                           Tgh079

score 2 · Accepted Answer

假设您有一个如下所示的数据集：

library(data.table)
library(ggplot2)
set.seed(123)
dna_src <- c("Chromosome", "Plasmid")
myDF <- data.table(Strain = c(rep("Campy3149c", 100),
                              rep("Campy31147q", 100)),
                   DNA_Source = c(sample(dna_src, size = 100, replace = T, 
                                    prob = c(0.9, 0.1)),
                                  sample(dna_src, size = 100, replace = T, 
                                    prob = c(0.7, 0.3))))
head(myDF)
#       Strain DNA_Source
#1: Campy3149c Chromosome
#2: Campy3149c Chromosome
#3: Campy3149c Chromosome
#4: Campy3149c Chromosome
#5: Campy3149c    Plasmid
#6: Campy3149c Chromosome

您可以使用data.table将数据折叠为data.table包含我们需要的大部分信息的较短的数据，唯一的添加是标签的 y 值，我们计算如下：

dt <- myDF[, .(countStrain = .N), by = c("Strain", "DNA_Source")][order(Strain, DNA_Source)]

# add the y-values for the plot
dt[, yval := cumsum(countStrain) - 0.5 * countStrain, by = Strain]

最后，我们绘制值

ggplot(dt, aes(x = Strain, y = countStrain, fill = DNA_Source)) + 
  geom_bar(stat = "identity") + 
  geom_text(data = dt, aes(x = Strain, y = yval, label = DNA_Source))

这导致了这样的情节：

r - ggplot2 histogram：如何使用 ggplot2 在直方图条上添加文本注释

1 回答 1

Related

Reference