r - 计算直方图或密度函数中的峰值

Question

似乎已经有很多“密度函数峰值”线程，但我没有看到一个专门解决这一点的问题。如果我错过了，很抱歉重复。

我的问题：给定一个包含 1000 个值的向量（附加样本），我想识别直方图中的峰值或数据的密度函数。从下面的示例数据图像中，我可以看到直方图中在 ~0、6200 和 8400 处的峰值。但我需要获得这些峰值的确切值，最好是通过一个简单的过程，因为我有几千个这些向量来处理。

组织和密度函数

我最初开始自己处理直方图输出，但无法让任何寻峰命令正常工作（就像，根本没有）。我什至不确定它如何从splus2R包中获取peaks()命令来处理直方图对象或密度对象。这仍然是我的偏好，因为我想确定每个峰值的最大频率的确切数据值（与密度函数值相反，它略有不同），但我也无法弄清楚。

我会自己发布示例数据，但在这里我看不到这样做的方法（对不起，如果我只是错过了它）。

score 7 · Accepted Answer

如果您的 y 值是平滑的（如在您的示例图中），这应该会发现峰值非常可重复：

peakx <- x[which(diff(sign(diff(y)))==-2)]

score 4 · Accepted Answer

正如评论中已经给出的那样，在密度函数中寻找峰值与寻找局部最大值和最小值有关，您可以在其中找到更多解决方案。chthonicdaemon的答案接近峰值，但每个差异都将向量长度减一。

#Create Dataset
x <- c(1,1,4,4,9)

#Estimate Density
d <- density(x)

#Two ways to get highest Peak
d$x[d$y==max(d$y)]  #Gives you all highest Peaks
d$x[which.max(d$y)] #Gives you the first highest Peak

#3 ways to get all Peaks
d$x[c(F, diff(diff(d$y)>=0)<0)] #This detects also a plateau
d$x[c(F, diff(sign(diff(d$y)))<0)]
d$x[which(diff(sign(diff(d$y)))<0)+1]

#In case you also want the height of the peaks
data.frame(d[c("x", "y")])[c(F, diff(diff(d$y)>=0)<0),]

#In case you need a higher "precision"
d <- density(x, n=1e4)

score 3 · Accepted Answer

既然你在考虑直方图，也许你应该直接使用直方图输出？

data <- c(rnorm(100,mean=20),rnorm(100,mean=12))

peakfinder <- function(d){
  dh <- hist(d,plot=FALSE)
  ins <- dh[["intensities"]]
  nbins <- length(ins)
  ss <- which(rank(ins)%in%seq(from=nbins-2,to=nbins)) ## pick the top 3 intensities
  dh[["mids"]][ss]
}

peaks <- peakfinder(data)

hist(data)
sapply(peaks,function(x) abline(v=x,col="red"))

这并不完美——例如，它只会找到顶部的垃圾箱，即使它们是相邻的。也许您可以更准确地定义“峰值”？希望有帮助。

在此处输入图像描述

score 1 · Accepted Answer

经过 8 年多的时间后，这仍然是一个有效且经典的问题。现在这是一个完整的答案，@chthonicdaemon 提供了极好的线索。

library(ggplot)
library(data.table)
### I use a preloaded data.table. You can use any data.table with one numeric column x.
### Extract counts & breaks of the histogram bins. 
### I have taken breaks as 40 but you can take any number as needed.
### But do keep a large number of breaks so that you get multiple peaks.
counts <- hist(dt1$x,breaks = 40)$counts
breaks <- hist(dt1$x, breaks = 40)$breaks
### Note: the data.table `dt1` should contain at least one numeric column, x

### now name the counts vector with the corresponding breaks 
### note: the length of counts is 1 less than the breaks
names(counts) <- breaks[-length(breaks)]

### Find index for those counts that are the peaks 
### (see previous classic clue to take a double diff)
### note: the double diff causes the 2 count shrink, hence
#### I have added a FALSE before and after the results 
### to align the T/F vector with the count vector

peak_indx <- c(F,diff(sign(c(diff(counts))))==-2,F) %>% which()
topcounts <- counts[peak_indx]
topbreaks <- names(topcounts) %>% as.numeric()

### Now let's use ggplot to plot the histogram along with visualised peaks

dt1 %>%     
ggplot() + 
geom_histogram(aes(x),bins = 40,col="grey51",na.rm = T) + 
geom_vline(xintercept = topbreaks + 50,lty = 2) 
# adjust the value 50 to bring the lines in the centre

r - 计算直方图或密度函数中的峰值

4 回答 4

Related

Reference