49

我一直在尝试用 ggplot 2 在我的直方图上叠加一条正态曲线。

我的公式:

data <- read.csv (path...)

ggplot(data, aes(V2)) + 
  geom_histogram(alpha=0.3, fill='white', colour='black', binwidth=.04)

我尝试了几件事:

+ stat_function(fun=dnorm)  

....没有改变任何东西

+ stat_density(geom = "line", colour = "red")

...在 x 轴上给了我一条直的红线。

+ geom_density()  

对我不起作用,因为我想将频率值保持在 y 轴上,并且不需要密度值。

有什么建议么?

提前感谢您的任何提示!

找到解决方案!

+geom_density(aes(y=0.045*..count..), colour="black", adjust=4)

4

5 回答 5

38

想我明白了:

set.seed(1)
df <- data.frame(PF = 10*rnorm(1000))
ggplot(df, aes(x = PF)) + 
    geom_histogram(aes(y =..density..),
                   breaks = seq(-50, 50, by = 10), 
                   colour = "black", 
                   fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(df$PF), sd = sd(df$PF)))

在此处输入图像描述

于 2012-11-28T16:33:52.450 回答
33

这已在此处和部分此处得到解答。

密度曲线下的面积等于 1,直方图下的面积等于条形的宽度乘以它们的高度之和,即。binwidth 乘以非缺失观测值的总数。为了将两者都放在同一张图上,需要重新调整一个或另一个,以便它们的区域匹配。

如果您希望 y 轴具有频率计数,则有多种选择:

首先模拟一些数据。

library(ggplot2)

set.seed(1)
dat_hist <- data.frame(
  group = c(rep("A", 200), rep("B",150)),
  value = c(rnorm(200, 20, 5), rnorm(150,25,10)))

# Set desired binwidth and number of non-missing obs
bw = 2
n_obs = sum(!is.na(dat_hist$value))

选项 1:将直方图和密度曲线都绘制为密度,然后重新调整 y 轴

对于单个直方图,这可能是最简单的方法。使用 Carlos 建议的方法,将直方图和密度曲线绘制为密度

g <- ggplot(dat_hist, aes(value))  + 
geom_histogram(aes(y = ..density..), binwidth = bw, colour = "black") + 
stat_function(fun = dnorm, args = list(mean = mean(dat_hist$value), sd = sd(dat_hist$value)))

然后重新调整 y 轴。

ybreaks = seq(0,50,5) 
## On primary axis
g + scale_y_continuous("Counts", breaks = round(ybreaks / (bw * n_obs),3), labels = ybreaks)

## Or on secondary axis
g + scale_y_continuous("Density", sec.axis = sec_axis(
  trans = ~ . * bw * n_obs, name = "Counts", breaks = ybreaks))

具有正态曲线的单个直方图

选项 2:使用 stat_function 重新缩放密度曲线

根据 PatrickT 的回答整理了代码。

ggplot(dat_hist, aes(value))  + 
  geom_histogram(colour = "black", binwidth = bw) + 
  stat_function(fun = function(x) 
    dnorm(x, mean = mean(dat_hist$value), sd = sd(dat_hist$value)) * bw * n_obs)

选项 3:创建外部数据集并使用 geom_line 绘图。

与上述选项不同,此选项适用于构面。(编辑以提供dplyr而不是plyr基于解决方案)。请注意,汇总数据集用作主要数据集,原始数据仅用于直方图。

library(tidyverse)

dat_hist %>% 
  group_by(group) %>% 
  nest(data = c(value)) %>% 
  mutate(y = map(data, ~ dnorm(
    .$value, mean = mean(.$value), sd = sd(.$value)
    ) * bw * sum(!is.na(.$value)))) %>% 
  unnest(c(data,y)) %>% 
  
  ggplot(aes(x = value)) +
  geom_histogram(data = dat_hist, binwidth = bw, colour = "black") +
  geom_line(aes(y = y)) + 
  facet_wrap(~ group)

具有正态曲线和刻面的直方图

选项 4:创建外部函数以动态编辑数据

也许有点过头了,但可能对某人有用?

## Function to create scaled dnorm data along full x axis range
dnorm_scaled <- function(data, x = NULL, binwidth = 1, xlim = NULL) {
  .x <- na.omit(data[,x])
  if(is.null(xlim))
    xlim = c(min(.x), max(.x))
  x_range = seq(xlim[1], xlim[2], length.out = 101)
  setNames(
    data.frame(
    x = x_range,
    y = dnorm(x_range, mean = mean(.x), sd = sd(.x)) * length(.x) * binwidth),
    c(x, "y"))
}

## Function to apply over groups
dnorm_scaled_group <- function(data, x = NULL, group = NULL, binwidth = NULL, xlim = NULL) {
  dat_hists <- lapply(
    split(data, data[, group]), dnorm_scaled,
      x = x, binwidth = binwidth, xlim = xlim)
  for(g in names(dat_hists))
    dat_hists[[g]][, "group"] <- g
  setNames(do.call(rbind, dat_hists), c(x, "y", group))
}

## Single histogram
ggplot(dat_hist, aes(value)) + 
  geom_histogram(binwidth = bw, colour = "black") + 
  geom_line(data = ~ dnorm_scaled(., "value", binwidth = bw), 
            aes(y = y)) 

## With a single faceting variable
ggplot(dat_hist, aes(value))  + 
  geom_histogram(binwidth = 2, colour = "black") + 
  geom_line(data = ~ dnorm_scaled_group(
    ., x = "value", group = "group", binwidth = 2, xlim = c(0,50)), 
    aes(y = y)) +
  facet_wrap(~ group)
于 2016-03-31T21:41:03.407 回答
15

这是对 JWilliman 答案的扩展评论。我发现 J 的回答非常有用。在玩耍时,我发现了一种简化代码的方法。我并不是说这是更好的方法,但我想我会提到它。

请注意,JWilliman 的答案提供了 y 轴上的计数和“hack”来缩放相应的密度正态近似值(否则它将覆盖 1 的总面积,因此峰值要低得多)。

此评论的要点:内部语法更简单stat_function,通过将所需参数传递给美学函数,例如

aes(x = x, mean = 0, sd = 1, binwidth = 0.3, n = 1000)

这避免了必须传递args =stat_function,因此更加用户友好。好吧,这并没有太大的不同,但希望有人会觉得它很有趣。

# parameters that will be passed to ``stat_function``
n = 1000
mean = 0
sd = 1
binwidth = 0.3 # passed to geom_histogram and stat_function
set.seed(1)
df <- data.frame(x = rnorm(n, mean, sd))

ggplot(df, aes(x = x, mean = mean, sd = sd, binwidth = binwidth, n = n)) +
    theme_bw() +
    geom_histogram(binwidth = binwidth, 
        colour = "white", fill = "cornflowerblue", size = 0.1) +
stat_function(fun = function(x) dnorm(x, mean = mean, sd = sd) * n * binwidth,
    color = "darkred", size = 1)

在此处输入图像描述

于 2017-10-22T17:16:33.763 回答
8

这段代码应该这样做:

set.seed(1)
z <- rnorm(1000)

qplot(z, geom = "blank") + 
geom_histogram(aes(y = ..density..)) + 
stat_density(geom = "line", aes(colour = "bla")) + 
stat_function(fun = dnorm, aes(x = z, colour = "blabla")) + 
scale_colour_manual(name = "", values = c("red", "green"), 
                               breaks = c("bla", "blabla"), 
                               labels = c("kernel_est", "norm_curv")) + 
theme(legend.position = "bottom", legend.direction = "horizontal")

在此处输入图像描述

注意:我使用了 qplot,但您可以使用更通用的 ggplot。

于 2011-08-06T15:26:08.937 回答
1

这是一个 tidyverse 通知版本:

设置

library(tidyverse)

一些数据

d <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/speed_gender_height.csv")

准备数据

我们将为整个样本使用“总”直方图,为此,我们需要从数据中删除分组信息。

d2 <-
  d |> 
  select(-gender)

这是一个包含汇总数据的数据集:

d_summary <-
  d %>% 
  group_by(gender) %>% 
  summarise(height_m = mean(height, na.rm = T),
            height_sd = sd(height, na.rm = T))

d_summary

绘制它

d %>% 
  ggplot() +
  aes() +
  geom_histogram(aes(y = ..density.., x = height, fill = gender)) +
  facet_wrap(~ gender) +
  geom_histogram(data = d2, aes(y = ..density.., x = height), 
                 alpha = .5) +
  stat_function(data = d_summary %>% filter(gender == "female"),
                fun = dnorm,
                #color = "red",
                args = list(mean = filter(d_summary, 
                                          gender == "female")$height_m,
                            sd = filter(d_summary, 
                                        gender == "female")$height_sd)) +
  stat_function(data = d_summary %>% filter(gender == "male"),
                fun = dnorm,
                #color = "red",
                args = list(mean = filter(d_summary, 
                                          gender == "male")$height_m,
                            sd = filter(d_summary, 
                                        gender == "male")$height_sd)) +
  theme(legend.position = "none",
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  labs(title = "Facetted histograms with overlaid normal curves",
       caption = "The grey histograms shows the whole distribution (over) both groups, i.e. females and men") +
  scale_fill_brewer(type = "qual", palette = "Set1")
   

于 2021-06-23T07:59:20.500 回答