r - ggplot2：具有正态曲线的直方图

Question

我一直在尝试用 ggplot 2 在我的直方图上叠加一条正态曲线。

我的公式：

data <- read.csv (path...)

ggplot(data, aes(V2)) + 
  geom_histogram(alpha=0.3, fill='white', colour='black', binwidth=.04)

我尝试了几件事：

+ stat_function(fun=dnorm)

....没有改变任何东西

+ stat_density(geom = "line", colour = "red")

...在 x 轴上给了我一条直的红线。

+ geom_density()

对我不起作用，因为我想将频率值保持在 y 轴上，并且不需要密度值。

有什么建议么？

提前感谢您的任何提示！

找到解决方案！

+geom_density(aes(y=0.045*..count..), colour="black", adjust=4)

score 38 · Accepted Answer

想我明白了：

set.seed(1)
df <- data.frame(PF = 10*rnorm(1000))
ggplot(df, aes(x = PF)) + 
    geom_histogram(aes(y =..density..),
                   breaks = seq(-50, 50, by = 10), 
                   colour = "black", 
                   fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(df$PF), sd = sd(df$PF)))

score 33 · Accepted Answer

这已在此处和部分此处得到解答。

密度曲线下的面积等于 1，直方图下的面积等于条形的宽度乘以它们的高度之和，即。binwidth 乘以非缺失观测值的总数。为了将两者都放在同一张图上，需要重新调整一个或另一个，以便它们的区域匹配。

如果您希望 y 轴具有频率计数，则有多种选择：

首先模拟一些数据。

library(ggplot2)

set.seed(1)
dat_hist <- data.frame(
  group = c(rep("A", 200), rep("B",150)),
  value = c(rnorm(200, 20, 5), rnorm(150,25,10)))

# Set desired binwidth and number of non-missing obs
bw = 2
n_obs = sum(!is.na(dat_hist$value))

选项 1：将直方图和密度曲线都绘制为密度，然后重新调整 y 轴

对于单个直方图，这可能是最简单的方法。使用 Carlos 建议的方法，将直方图和密度曲线绘制为密度

g <- ggplot(dat_hist, aes(value))  + 
geom_histogram(aes(y = ..density..), binwidth = bw, colour = "black") + 
stat_function(fun = dnorm, args = list(mean = mean(dat_hist$value), sd = sd(dat_hist$value)))

然后重新调整 y 轴。

ybreaks = seq(0,50,5) 
## On primary axis
g + scale_y_continuous("Counts", breaks = round(ybreaks / (bw * n_obs),3), labels = ybreaks)

## Or on secondary axis
g + scale_y_continuous("Density", sec.axis = sec_axis(
  trans = ~ . * bw * n_obs, name = "Counts", breaks = ybreaks))

选项 2：使用 stat_function 重新缩放密度曲线

根据 PatrickT 的回答整理了代码。

ggplot(dat_hist, aes(value))  + 
  geom_histogram(colour = "black", binwidth = bw) + 
  stat_function(fun = function(x) 
    dnorm(x, mean = mean(dat_hist$value), sd = sd(dat_hist$value)) * bw * n_obs)

选项 3：创建外部数据集并使用 geom_line 绘图。

与上述选项不同，此选项适用于构面。（编辑以提供dplyr而不是plyr基于解决方案）。请注意，汇总数据集用作主要数据集，原始数据仅用于直方图。

library(tidyverse)

dat_hist %>% 
  group_by(group) %>% 
  nest(data = c(value)) %>% 
  mutate(y = map(data, ~ dnorm(
    .$value, mean = mean(.$value), sd = sd(.$value)
    ) * bw * sum(!is.na(.$value)))) %>% 
  unnest(c(data,y)) %>% 
  
  ggplot(aes(x = value)) +
  geom_histogram(data = dat_hist, binwidth = bw, colour = "black") +
  geom_line(aes(y = y)) + 
  facet_wrap(~ group)

选项 4：创建外部函数以动态编辑数据

也许有点过头了，但可能对某人有用？

## Function to create scaled dnorm data along full x axis range
dnorm_scaled <- function(data, x = NULL, binwidth = 1, xlim = NULL) {
  .x <- na.omit(data[,x])
  if(is.null(xlim))
    xlim = c(min(.x), max(.x))
  x_range = seq(xlim[1], xlim[2], length.out = 101)
  setNames(
    data.frame(
    x = x_range,
    y = dnorm(x_range, mean = mean(.x), sd = sd(.x)) * length(.x) * binwidth),
    c(x, "y"))
}

## Function to apply over groups
dnorm_scaled_group <- function(data, x = NULL, group = NULL, binwidth = NULL, xlim = NULL) {
  dat_hists <- lapply(
    split(data, data[, group]), dnorm_scaled,
      x = x, binwidth = binwidth, xlim = xlim)
  for(g in names(dat_hists))
    dat_hists[[g]][, "group"] <- g
  setNames(do.call(rbind, dat_hists), c(x, "y", group))
}

## Single histogram
ggplot(dat_hist, aes(value)) + 
  geom_histogram(binwidth = bw, colour = "black") + 
  geom_line(data = ~ dnorm_scaled(., "value", binwidth = bw), 
            aes(y = y)) 

## With a single faceting variable
ggplot(dat_hist, aes(value))  + 
  geom_histogram(binwidth = 2, colour = "black") + 
  geom_line(data = ~ dnorm_scaled_group(
    ., x = "value", group = "group", binwidth = 2, xlim = c(0,50)), 
    aes(y = y)) +
  facet_wrap(~ group)

score 15 · Accepted Answer

这是对 JWilliman 答案的扩展评论。我发现 J 的回答非常有用。在玩耍时，我发现了一种简化代码的方法。我并不是说这是更好的方法，但我想我会提到它。

请注意，JWilliman 的答案提供了 y 轴上的计数和“hack”来缩放相应的密度正态近似值（否则它将覆盖 1 的总面积，因此峰值要低得多）。

此评论的要点：内部语法更简单stat_function，通过将所需参数传递给美学函数，例如

aes(x = x, mean = 0, sd = 1, binwidth = 0.3, n = 1000)

这避免了必须传递args =到stat_function，因此更加用户友好。好吧，这并没有太大的不同，但希望有人会觉得它很有趣。

# parameters that will be passed to ``stat_function``
n = 1000
mean = 0
sd = 1
binwidth = 0.3 # passed to geom_histogram and stat_function
set.seed(1)
df <- data.frame(x = rnorm(n, mean, sd))

ggplot(df, aes(x = x, mean = mean, sd = sd, binwidth = binwidth, n = n)) +
    theme_bw() +
    geom_histogram(binwidth = binwidth, 
        colour = "white", fill = "cornflowerblue", size = 0.1) +
stat_function(fun = function(x) dnorm(x, mean = mean, sd = sd) * n * binwidth,
    color = "darkred", size = 1)

score 8 · Accepted Answer

这段代码应该这样做：

set.seed(1)
z <- rnorm(1000)

qplot(z, geom = "blank") + 
geom_histogram(aes(y = ..density..)) + 
stat_density(geom = "line", aes(colour = "bla")) + 
stat_function(fun = dnorm, aes(x = z, colour = "blabla")) + 
scale_colour_manual(name = "", values = c("red", "green"), 
                               breaks = c("bla", "blabla"), 
                               labels = c("kernel_est", "norm_curv")) + 
theme(legend.position = "bottom", legend.direction = "horizontal")

注意：我使用了 qplot，但您可以使用更通用的 ggplot。

score 1 · Accepted Answer

这是一个 tidyverse 通知版本：

设置

library(tidyverse)

一些数据

d <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/speed_gender_height.csv")

准备数据

我们将为整个样本使用“总”直方图，为此，我们需要从数据中删除分组信息。

d2 <-
  d |> 
  select(-gender)

这是一个包含汇总数据的数据集：

d_summary <-
  d %>% 
  group_by(gender) %>% 
  summarise(height_m = mean(height, na.rm = T),
            height_sd = sd(height, na.rm = T))

d_summary

绘制它

d %>% 
  ggplot() +
  aes() +
  geom_histogram(aes(y = ..density.., x = height, fill = gender)) +
  facet_wrap(~ gender) +
  geom_histogram(data = d2, aes(y = ..density.., x = height), 
                 alpha = .5) +
  stat_function(data = d_summary %>% filter(gender == "female"),
                fun = dnorm,
                #color = "red",
                args = list(mean = filter(d_summary, 
                                          gender == "female")$height_m,
                            sd = filter(d_summary, 
                                        gender == "female")$height_sd)) +
  stat_function(data = d_summary %>% filter(gender == "male"),
                fun = dnorm,
                #color = "red",
                args = list(mean = filter(d_summary, 
                                          gender == "male")$height_m,
                            sd = filter(d_summary, 
                                        gender == "male")$height_sd)) +
  theme(legend.position = "none",
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  labs(title = "Facetted histograms with overlaid normal curves",
       caption = "The grey histograms shows the whole distribution (over) both groups, i.e. females and men") +
  scale_fill_brewer(type = "qual", palette = "Set1")

r - ggplot2：具有正态曲线的直方图

5 回答 5

设置

一些数据

准备数据

绘制它

Related

Reference