r - r：直方图上的 ecdf

Question

在 R 中，ecdf我可以绘制一个经验累积分布函数

plot(ecdf(mydata))

并且hist我可以绘制我的数据的直方图

hist(mydata)

如何在同一个图中绘制直方图和 ecdf？

编辑

我试着做这样的东西

https://mathematica.stackexchange.com/questions/18723/how-do-i-overlay-a-histogram-with-a-plot-of-cdf

score 9 · Accepted Answer

也有点晚了，这是另一个解决方案，它用第二个 y 轴扩展了 @Christoph 的解决方案。

par(mar = c(5,5,2,5))
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
  dt,
  breaks = seq(0, 100, 1),
  xlim = c(0,100))

par(new = T)

ec <- ecdf(dt)
plot(x = h$mids, y=ec(h$mids)*max(h$counts), col = rgb(0,0,0,alpha=0), axes=F, xlab=NA, ylab=NA)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
axis(4, at=seq(from = 0, to = max(h$counts), length.out = 11), labels=seq(0, 1, 0.1), col = 'red', col.axis = 'red')
mtext(side = 4, line = 3, 'Cumulative Density', col = 'red')

诀窍如下：您不要在情节中添加一条线，而是在顶部绘制另一个情节，这就是我们需要par(new = T). 然后您必须稍后添加 y 轴（否则它将绘制在左侧的 y 轴上）。

学分在这里（@tim_yates 回答）和那里。

score 4 · Accepted Answer

有两种方法可以解决这个问题。一种是忽略不同的比例并在直方图中使用相对频率。这导致更难阅读直方图。第二种方法是改变一个或另一个元素的比例。

我怀疑这个问题很快就会引起你的兴趣，尤其是@hadley 的回答。

ggplot2单比例

这是一个解决方案ggplot2。我不确定您是否会对结果感到满意，因为 CDF 和直方图（计数或相对）在完全不同的视觉尺度上。请注意，此解决方案具有数据框中的数据，该数据框中mydata使用所需的变量在x.

library(ggplot2)
set.seed(27272)
mydata <- data.frame(x=  rexp(333, rate=4) + rnorm(333))

 ggplot(mydata, aes(x)) + 
     stat_ecdf(color="red") + 
     geom_bar(aes(y = (..count..)/sum(..count..)))

ggplotecdfhist

基础 R 多尺度

在这里，我将重新调整经验 CDF，以便其最大值不是最大值 1，而是具有最高相对频率的任何 bin。

h  <- hist(mydata$x, freq=F)
ec <- ecdf(mydata$x)
lines(x = knots(ec), 
    y=(1:length(mydata$x))/length(mydata$x) * max(h$density), 
    col ='red')

基本记录

score 3 · Accepted Answer

您可以尝试使用第二个轴的 ggplot 方法

set.seed(15)
a <- rnorm(500, 50, 10)

# calculate ecdf with binsize 30
binsize=30
df <- tibble(x=seq(min(a), max(a), diff(range(a))/binsize)) %>% 
        bind_cols(Ecdf=with(.,ecdf(a)(x))) %>% 
        mutate(Ecdf_scaled=Ecdf*max(a))
# plot
ggplot() + 
  geom_histogram(aes(a), bins = binsize) +
  geom_line(data = df, aes(x=x, y=Ecdf_scaled), color=2, size = 2) + 
  scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(a), name = "Ecdf"))

编辑

由于缩放错误，我添加了第二个解决方案，提前计算所有内容：

binsize=30
a_range= floor(range(a)) +c(0,1)

b <- seq(a_range[1], a_range[2], round(diff(a_range)/binsize)) %>% floor() 


df_hist <- tibble(a) %>% 
  mutate(gr = cut(a,b, labels = floor(b[-1]), include.lowest = T, right = T)) %>% 
  count(gr) %>% 
  mutate(gr = as.character(gr) %>% as.numeric()) 

# calculate ecdf with binsize 30
df <- tibble(x=b) %>% 
  bind_cols(Ecdf=with(.,ecdf(a)(x))) %>% 
  mutate(Ecdf_scaled=Ecdf*max(df_hist$n))
  
ggplot(df_hist, aes(gr, n)) + 
   geom_col(width = 2, color = "white") + 
   geom_line(data = df, aes(x=x, y=Ecdf*max(df_hist$n)), color=2, size = 2) +
   scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(df_hist$n), name = "Ecdf"))

score 2 · Accepted Answer

正如已经指出的那样，这是有问题的，因为您要合并的图具有如此不同的 y 尺度。你可以试试

set.seed(15)
mydata<-runif(50)
hist(mydata, freq=F)
lines(ecdf(mydata))

要得到

在此处输入图像描述

score 2 · Accepted Answer

虽然有点晚了......另一个使用预设垃圾箱的版本：

set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
    dt,
    breaks = seq(0, 100, 1),
    xlim = c(0,100))
    ec <- ecdf(dt)
    lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
    lines(x = c(0,100), y=c(1,1)*max(h$counts), col ='red', lty = 3) # indicates 100%
    lines(x = c(which.min(abs(ec(h$mids) - 0.9)), which.min(abs(ec(h$mids) - 0.9))), # indicates where 90% is reached
          y = c(0, max(h$counts)), col ='black', lty = 3)

（只有第二个 y 轴还没有工作……）

score 0 · Accepted Answer

除了以前的答案之外，我还想让 ggplot 进行繁琐的计算（与@Roman's solution相比，它已根据我的要求进行了更新），即计算和绘制直方图并计算和覆盖 ECDF。我想出了以下（伪代码）：

# 1. Prepare the plot
plot <- ggplot() + geom_hist(...)

# 2. Get the max value of Y axis as calculated in the previous step
maxPlotY <- max(ggplot_build(plot)$data[[1]]$y)

# 3. Overlay scaled ECDF and add secondary axis
plot +
  stat_ecdf(aes(y=..y..*maxPlotY)) +
  scale_y_continuous(name = "Density", sec.axis = sec_axis(trans = ~./maxPlotY, name = "ECDF"))

这样，您无需事先计算所有内容并将结果提供给ggpplot. 躺下来，让它为你做一切！

r - r：直方图上的 ecdf

编辑

6 回答 6

ggplot2单比例

基础 R 多尺度

编辑

Related

Reference