r - 从原始数据和汇总数据生成相同的图表

Question

来自原始数据和汇总数据的相同图

对于以下数据结构

dsN<-data.frame(
  id=rep(1:100, each=4),
  yearF=factor(rep(2001:2004, 100)),
  attendF=sample(1:8, 400, T, c(.2,.2,.15,.10,.10, .20, .15, .02))
)
dsN[sample(which(dsN$yearF==2001), 5), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2002), 10), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2003), 15), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2004), 20), "attendF"]<-NA

attcol8<-c("Never"="#4575b4",
           "Once or Twice"="#74add1",
           "Less than once/month"="#abd9e9",
           "About once/month"="#e0f3f8",
           "About twice/month"="#fee090",
           "About once/week"="#fdae61",
           "Several times/week"="#f46d43",
           "Everyday"="#d73027")
dsN$attendF<-factor(dsN$attendF, levels=1:8, labels=names(attcol8))
head(dsN,13)

   id yearF              attendF
1   1  2001      About once/week
2   1  2002     About once/month
3   1  2003      About once/week
4   1  2004                 <NA>
5   2  2001 Less than once/month
6   2  2002      About once/week
7   2  2003      About once/week
8   2  2004   Several times/week
9   3  2001        Once or Twice
10  3  2002      About once/week
11  3  2003                 <NA>
12  3  2004        Once or Twice
13  4  2001   Several times/week

我们可以得到一系列堆积条形图

require(ggplot2)
# p<- ggplot( subset(dsN,!is.na(attendF)), aes(x=yearF, fill=attendF)) # leaving NA out of
p<- ggplot( dsN, aes(x=yearF, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="fill")
p<- p+ scale_fill_manual(values = attcol8,
                         name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
                          limits=c(0, 1),
                          breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
                        limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

在此处输入图像描述

上图是根据原始数据生成的。但是，有时从汇总数据中生成图表很方便，尤其是在需要控制统计函数的情况下。下面是 dsN 到 ds 的转换，其中仅包含实际映射到上图的值：

require(dplyr)
ds<- dsN %.%
  dplyr::filter(!is.na(attendF)) %.%
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = sum(attendF)) %.%
  dplyr::mutate(total = sum(count),
              percent= count/total)
head(ds,10)

    Source: local data frame [10 x 5]
    Groups: yearF

       yearF              attendF count total percent
    1   2001                Never    18   373 0.04826
    2   2001        Once or Twice    36   373 0.09651
    3   2001 Less than once/month    30   373 0.08043
    4   2001     About once/month    32   373 0.08579
    5   2001    About twice/month    40   373 0.10724
    6   2001      About once/week    90   373 0.24129
    7   2001   Several times/week   119   373 0.31903
    8   2001             Everyday     8   373 0.02145
    9   2002                Never    11   355 0.03099
    10  2002        Once or Twice    44   355 0.12394

# verify
summarize(filter(ds, yearF==2001), should.be.one=sum(percent))
```

    Source: local data frame [1 x 2]

      yearF should.be.one
    1  2001             1

问题：

如何使用此摘要数据集从上方重新创建图表 ds？

score 2 · Accepted Answer

好吧，部分问题是您的总结不正确。如果您想在总数中正确考虑它们，则需要将 NA 值保留在那里。也许尝试

ds<- dsN %.%
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = length(attendF)) %.%
  dplyr::mutate(total = sum(count, na.rm=T),
              percent= count/total)

然后使用汇总数据，你只需要稍微改变你的前两行

p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")

请注意，我们添加了一个特定的y值，并告诉 geom_bar 使用stat="identity"这意味着使用y我们提供的实际值作为高度。他们将产生相同的图像

在此处输入图像描述

score 0 · Accepted Answer

来自评论 1 的回答

正如@MrFlick 指出的那样，错误出现在summarize() 中的计算公式中。然而，在计算总数时是否留下缺失值是一个有意义的研究决定。如果我们要NA在计算中留下总数：

ds<- dsN %.%
#   dplyr::filter(!is.na(attendF)) %.%   # comment out to count NA in the total
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = length( attendF)) %.%
  dplyr::mutate(total = sum(count),
              percent= count/total)
head(ds,10)

     Source: local data frame [10 x 5]
    Groups: yearF

       yearF              attendF count total percent
    1   2001                Never    23   100    0.23
    2   2001        Once or Twice     9   100    0.09
    3   2001 Less than once/month    16   100    0.16
    4   2001     About once/month    11   100    0.11
    5   2001    About twice/month     3   100    0.03
    6   2001      About once/week    21   100    0.21
    7   2001   Several times/week     9   100    0.09
    8   2001             Everyday     3   100    0.03
    9   2001                   NA     5   100    0.05
    10  2002                Never    17   100    0.17

缺失值用于计算总响应以显示研究中的自然损耗。

p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")
p<- p+ scale_fill_manual(values = attcol8,
                         name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
                          limits=c(0, 1),
                          breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
                        limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

在此处输入图像描述

然而，假设减员与结果测量没有显着相关性，那么看看响应认可的相对流行度如何随时间变化或可能保持平衡将是有趣的。为此，我们需要从响应总数的计算中删除缺失值：

ds<- dsN %.%
  dplyr::filter(!is.na(attendF)) %.%   # comment out to count NA in the total
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = length( attendF)) %.%
  dplyr::mutate(total = sum(count),
              percent= count/total)
head(ds,10)

    Source: local data frame [10 x 5]
    Groups: yearF

       yearF              attendF count total percent
    1   2001                Never    23    95 0.24211
    2   2001        Once or Twice     9    95 0.09474
    3   2001 Less than once/month    16    95 0.16842
    4   2001     About once/month    11    95 0.11579
    5   2001    About twice/month     3    95 0.03158
    6   2001      About once/week    21    95 0.22105
    7   2001   Several times/week     9    95 0.09474
    8   2001             Everyday     3    95 0.03158
    9   2002                Never    17    90 0.18889
    10  2002        Once or Twice    23    90 0.25556

图表相应地反映了这一点：

p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")
p<- p+ scale_fill_manual(values = attcol8,
                         name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
                          limits=c(0, 1),
                          breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
                        limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

在此处输入图像描述

谢谢，@MrFlick！

r - 从原始数据和汇总数据生成相同的图表

来自原始数据和汇总数据的相同图

问题：

2 回答 2

来自评论 1 的回答

Related

Reference