我有不同大小的向量,我想对它们进行平均采样(例如每个向量的 10 个样本),以这些样本代表每个向量的方式。
假设我的向量之一是
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
这个向量的 10 个代表点是什么?
我有不同大小的向量,我想对它们进行平均采样(例如每个向量的 10 个样本),以这些样本代表每个向量的方式。
假设我的向量之一是
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
这个向量的 10 个代表点是什么?
如果您指的是保留曲线的形状,您可以尝试保留局部最小值和最大值:
df = as.data.frame(y)
y2 <- df %>%
mutate(loc_minima = if_else(lag(y) > y & lead(y) > y, TRUE, FALSE)) %>%
mutate(loc_maxima = if_else(lag(y) < y & lead(y) < y, TRUE, FALSE)) %>%
filter(loc_minima == TRUE | loc_maxima == TRUE) %>%
select(y)
尽管这并不能保证您将获得 10 分。
感谢@minem,我得到了答案。完美的!
library(kmlShape)
Px=(1:length(y))
Py=y
par(mfrow=c(1,2))
plot(Px,Py,type="l",main="original points")
plot(DouglasPeuckerNbPoints(Px,Py,10),type="b",col=2,main="reduced points")
结果如下(使用Ramer–Douglas–Peucker 算法):
已经给出了最好的答案,但是由于我正在研究它,我将发布我的幼稚启发式解决方案:
免责声明:
这肯定比 Ramer-Douglas-Peucker 算法效率低且幼稚,但在这种情况下,它给出了类似的结果......
# Try to remove iteratively one element from the vector until we reach N elements only.
# At each iteration, the reduced vector is interpolated and completed again
# using a spline, then it's compared with the original one and the
# point leading to the smallest difference is selected for the removal.
heuristicDownSample <- function(x,y,n=10){
idxReduced <- 1:length(x)
while(length(idxReduced) > 10){
minDist <- NULL
idxFinal <- NULL
for(idxToRemove in 1:length(idxReduced)){
newIdxs <- idxReduced[-idxToRemove]
spf <- splinefun(x[newIdxs],y[newIdxs])
full <- spf(x)
dist <- sum((full-y)^2)
if(is.null(minDist) || dist < minDist){
minDist <- dist
idxFinal <- newIdxs
}
}
idxReduced <- idxFinal
}
return(list(x=x[idxReduced],y=y[idxReduced]))
}
用法 :
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
x <- 1:length(y)
reduced <- heuristicDownSample(x,y,10)
par(mfrow=c(1,2))
plot(x=x,y=y,type="b",main="original")
plot(x=reduced$x,y=reduced$y,type="b",main="reduced",col='red')
显然您对系统抽样感兴趣。如果是这样,也许以下内容会有所帮助。
set.seed(1234)
n <- 10
step <- floor(length(y)/n)
first <- sample(step, 1)
z <- y[step*(seq_len(n) - 1) + first]
您可以使用cut
生成一个因子来指示您的值属于哪个五分位数(或您想要的任何分位数),然后从那里采样:
df <- data.frame(values = c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23))
cutpoints <- seq(min(df$values), max(df$values), length.out = 5)
> cutpoints
[1] -2.00 4.25 10.50 16.75 23.00
df$quintiles <- cut(df$values, cutpoints, include.lowest = TRUE)
> df
values quintiles
1 2.5 [-2,4.25]
2 1.0 [-2,4.25]
3 0.0 [-2,4.25]
4 1.2 [-2,4.25]
5 2.0 [-2,4.25]
6 3.0 [-2,4.25]
7 2.0 [-2,4.25]
8 1.0 [-2,4.25]
9 0.0 [-2,4.25]
10 -2.0 [-2,4.25]
11 -1.0 [-2,4.25]
12 0.5 [-2,4.25]
13 2.0 [-2,4.25]
14 3.0 [-2,4.25]
15 6.0 (4.25,10.5]
16 5.0 (4.25,10.5]
17 7.0 (4.25,10.5]
18 9.0 (4.25,10.5]
19 11.0 (10.5,16.8]
20 15.0 (10.5,16.8]
21 23.0 (16.8,23]
现在您可以split
通过 计算数据quintiles
,计算各组的倾向和样本。
groups <- split(df, df$quintiles)
probs <- prop.table(table(df$quintiles))
nsample <- as.vector(ceiling(probs*10))
> nsample
[1] 7 2 1 1
resample <- function(x, ...) x[sample.int(length(x), ...)]
mysamples <- mapply(function(x, y) resample(x = x, size = y), groups, nsample)
z <- unname(unlist(mysamples))
> z
[1] 2.0 1.0 0.0 1.0 3.0 0.5 3.0 5.0 9.0 11.0 23.0
由于ceiling()
,这可能导致抽样 11 个案例而不是 10 个。