r - 使用ggplot2从已经汇总的计数中堆叠直方图

Question

我需要一些帮助，为从已经汇总的计数数据生成的 ggplot2 直方图着色。

这些数据类似于生活在许多不同地区的 # 名男性和 # 名女性的数量。绘制总计数的直方图很容易（即男性 + 女性）：

set.seed(1)
N=100;
X=data.frame(C1=rnbinom(N,15,0.1), C2=rnbinom(N,15,0.1),C=rep(0,N)); 
X$C=X$C1+X$C2;
ggplot(X,aes(x=C)) + geom_histogram()

但是，我想根据 C1 和 C2 的相对贡献为每个条形图着色，以便获得与上例相同的直方图（即整体条形图高度），另外我还可以看到“C1”类型的比例和堆积条形图中的“C2”个体。

建议使用 ggplot2 以干净的方式执行此操作，在示例中使用“X”之类的数据？

score 14 · Accepted Answer

很快，您可以使用stat="identity"选项和plyr包手动计算直方图来执行 OP 想要的操作，如下所示：

library(plyr)

X$mid <- floor(X$C/20)*20+10
X_plot <- ddply(X, .(mid), summarize, total=length(C), split=sum(C1)/sum(C)*length(C))

ggplot(data=X_plot) + geom_histogram(aes(x=mid, y=total), fill="blue", stat="identity") + geom_histogram(aes(x=mid, y=split), fill="deeppink", stat="identity")

我们基本上只是为如何定位列制作一个“中间”列，然后制作两个图：一个是总数（C）的计数，另一个是调整到其中一列的计数（C1）的列。您应该可以从这里进行自定义。

直方图演示

更新 1：我意识到我在计算中频时犯了一个小错误。现在修好了。另外，我不知道为什么我使用“ddply”语句来计算中间值。那很愚蠢。新代码更清晰、更简洁。

更新 2：我返回查看评论并注意到一些令人恐惧的事情：我使用总和作为直方图频率。我已经对代码进行了一些清理，并从有关着色语法的评论中添加了一些建议。

score 7 · Accepted Answer

这是一个使用ggplot_build. 这个想法是首先得到你的旧/原始情节：

p <- ggplot(data = X, aes(x=C)) + geom_histogram()

存储在p. 然后，用于ggplot_build(p)$data[[1]]提取数据，特别是列xmin和xmax（以获得相同的直方图中断/binwidths）和count列（通过标准化百分比count。这是代码：

# get old plot
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
# get data of old plot: cols = count, xmin and xmax
d <- ggplot_build(p)$data[[1]][c("count", "xmin", "xmax")]
# add a id colum for ddply
d$id <- seq(nrow(d))

现在如何生成数据？我从你的帖子中了解到的是这样的。以情节中的第一个栏为例。它的计数为 2，并且从延伸xmin = 147到xmax = 156.8。当我们检查X这些值时：

X[X$C >= 147 & X$C <= 156.8, ] # count = 2 as shown below
#    C1 C2   C
# 19 91 63 154
# 75 86 70 156

在这里，我计算(91+86)/(154+156)*(count=2) = 1.141935和(63+70)/(154+156) * (count=2) = 0.8580645作为我们将生成的每个条的两个标准化值。

require(plyr)
dd <- ddply(d, .(id), function(x) {
    t <- X[X$C >= x$xmin & X$C <= x$xmax, ]
    if(nrow(t) == 0) return(c(0,0))
    p <- colSums(t)[1:2]/colSums(t)[3] * x$count
})

# then, it just normal plotting
require(reshape2)
dd <- melt(dd, id.var="id")
ggplot(data = dd, aes(x=id, y=value)) + 
      geom_bar(aes(fill=variable), stat="identity", group=1)

这是原始情节：

original_ggplot2_plot

这就是我得到的：

ggplot2_weird_histogram_plot

编辑：如果您还想获得正确的休息时间，那么您可以x从旧情节中获取相应的坐标并在此处使用它而不是id：

p <- ggplot(data = X, aes(x=C)) + geom_histogram()
d <- ggplot_build(p)$data[[1]][c("count", "x", "xmin", "xmax")]
d$id <- seq(nrow(d))

require(plyr)
dd <- ddply(d, .(id), function(x) {
    t <- X[X$C >= x$xmin & X$C <= x$xmax, ]
    if(nrow(t) == 0) return(c(x$x,0,0))
    p <- c(x=x$x, colSums(t)[1:2]/colSums(t)[3] * x$count)
})

require(reshape2)
dd.m <- melt(dd, id.var="V1", measure.var=c("V2", "V3"))
ggplot(data = dd.m, aes(x=V1, y=value)) + 
      geom_bar(aes(fill=variable), stat="identity", group=1)

在此处输入图像描述

score 2 · Accepted Answer

怎么样：

library("reshape2")
mm <- melt(X[,1:2])
ggplot(mm,aes(x=value,fill=variable))+geom_histogram(position="stack")

r - 使用ggplot2从已经汇总的计数中堆叠直方图

3 回答 3

Related

Reference