9

似乎已经有很多“密度函数峰值”线程,但我没有看到一个专门解决这一点的问题。如果我错过了,很抱歉重复。

我的问题:给定一个包含 1000 个值的向量(附加样本),我想识别直方图中的峰值或数据的密度函数。从下面的示例数据图像中,我可以看到直方图中在 ~0、6200 和 8400 处的峰值。但我需要获得这些峰值的确切值,最好是通过一个简单的过程,因为我有几千个这些向量来处理。

组织和密度函数

我最初开始自己处理直方图输出,但无法让任何寻峰命令正常工作(就像,根本没有)。我什至不确定它如何从splus2R包中获取peaks()命令来处理直方图对象或密度对象。这仍然是我的偏好,因为我想确定每个峰值的最大频率的确切数据值(与密度函数值相反,它略有不同),但我也无法弄清楚。

我会自己发布示例数据,但在这里我看不到这样做的方法(对不起,如果我只是错过了它)。

4

4 回答 4

7

如果您的 y 值是平滑的(如在您的示例图中),这应该会发现峰值非常可重复:

peakx <- x[which(diff(sign(diff(y)))==-2)]
于 2012-10-30T06:08:54.723 回答
4

正如评论中已经给出的那样,在密度函数中寻找峰值与寻找局部最大值和最小值有关,您可以在其中找到更多解决方案。chthonicdaemon的答案接近峰值,但每个差异都将向量长度减一。

#Create Dataset
x <- c(1,1,4,4,9)

#Estimate Density
d <- density(x)

#Two ways to get highest Peak
d$x[d$y==max(d$y)]  #Gives you all highest Peaks
d$x[which.max(d$y)] #Gives you the first highest Peak

#3 ways to get all Peaks
d$x[c(F, diff(diff(d$y)>=0)<0)] #This detects also a plateau
d$x[c(F, diff(sign(diff(d$y)))<0)]
d$x[which(diff(sign(diff(d$y)))<0)+1]

#In case you also want the height of the peaks
data.frame(d[c("x", "y")])[c(F, diff(diff(d$y)>=0)<0),]

#In case you need a higher "precision"
d <- density(x, n=1e4)
于 2019-03-25T14:10:02.673 回答
3

既然你在考虑直方图,也许你应该直接使用直方图输出?

data <- c(rnorm(100,mean=20),rnorm(100,mean=12))

peakfinder <- function(d){
  dh <- hist(d,plot=FALSE)
  ins <- dh[["intensities"]]
  nbins <- length(ins)
  ss <- which(rank(ins)%in%seq(from=nbins-2,to=nbins)) ## pick the top 3 intensities
  dh[["mids"]][ss]
}

peaks <- peakfinder(data)

hist(data)
sapply(peaks,function(x) abline(v=x,col="red"))

这并不完美——例如,它只会找到顶部的垃圾箱,即使它们是相邻的。也许您可以更准确地定义“峰值”?希望有帮助。

在此处输入图像描述

于 2012-10-30T06:46:50.350 回答
1

经过 8 年多的时间后,这仍然是一个有效且经典的问题。现在这是一个完整的答案,@chthonicdaemon 提供了极好的线索。

library(ggplot)
library(data.table)
### I use a preloaded data.table. You can use any data.table with one numeric column x.
### Extract counts & breaks of the histogram bins. 
### I have taken breaks as 40 but you can take any number as needed.
### But do keep a large number of breaks so that you get multiple peaks.
counts <- hist(dt1$x,breaks = 40)$counts
breaks <- hist(dt1$x, breaks = 40)$breaks
### Note: the data.table `dt1` should contain at least one numeric column, x

### now name the counts vector with the corresponding breaks 
### note: the length of counts is 1 less than the breaks
names(counts) <- breaks[-length(breaks)]

### Find index for those counts that are the peaks 
### (see previous classic clue to take a double diff)
### note: the double diff causes the 2 count shrink, hence
#### I have added a FALSE before and after the results 
### to align the T/F vector with the count vector

peak_indx <- c(F,diff(sign(c(diff(counts))))==-2,F) %>% which()
topcounts <- counts[peak_indx]
topbreaks <- names(topcounts) %>% as.numeric()

### Now let's use ggplot to plot the histogram along with visualised peaks

dt1 %>%     
ggplot() + 
geom_histogram(aes(x),bins = 40,col="grey51",na.rm = T) + 
geom_vline(xintercept = topbreaks + 50,lty = 2) 
# adjust the value 50 to bring the lines in the centre

输出带有峰值标记的直方图

于 2021-11-04T09:50:34.917 回答