r - 如何以表示 R 中原始向量的一般形状的方式对向量进行子集化

Question

我有不同大小的向量，我想对它们进行平均采样（例如每个向量的 10 个样本），以这些样本代表每个向量的方式。

假设我的向量之一是

y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)

这个向量的 10 个代表点是什么？

score 1 · Accepted Answer

如果您指的是保留曲线的形状，您可以尝试保留局部最小值和最大值：

df = as.data.frame(y)
y2 <- df %>%
  mutate(loc_minima = if_else(lag(y) > y & lead(y) > y, TRUE, FALSE)) %>% 
  mutate(loc_maxima = if_else(lag(y) < y & lead(y) < y, TRUE, FALSE)) %>%
  filter(loc_minima == TRUE | loc_maxima == TRUE) %>%
  select(y)

尽管这并不能保证您将获得 10 分。

score 1 · Accepted Answer

感谢@minem，我得到了答案。完美的！

library(kmlShape)

Px=(1:length(y))
Py=y    
par(mfrow=c(1,2))
    plot(Px,Py,type="l",main="original points")
    plot(DouglasPeuckerNbPoints(Px,Py,10),type="b",col=2,main="reduced points")

结果如下（使用Ramer–Douglas–Peucker 算法）：样本减少使用

score 1 · Accepted Answer

已经给出了最好的答案，但是由于我正在研究它，我将发布我的幼稚启发式解决方案：

免责声明：
这肯定比 Ramer-Douglas-Peucker 算法效率低且幼稚，但在这种情况下，它给出了类似的结果......

# Try to remove iteratively one element from the vector until we reach N elements only.
# At each iteration, the reduced vector is interpolated and completed again
# using a spline, then it's compared with the original one and the
# point leading to the smallest difference is selected for the removal.
heuristicDownSample <- function(x,y,n=10){
  idxReduced <- 1:length(x)
  while(length(idxReduced) > 10){
    minDist <- NULL
    idxFinal <- NULL
    for(idxToRemove in 1:length(idxReduced)){
      newIdxs <- idxReduced[-idxToRemove]
      spf <- splinefun(x[newIdxs],y[newIdxs])
      full <- spf(x)
      dist <- sum((full-y)^2)
      if(is.null(minDist) || dist < minDist){
        minDist <- dist
        idxFinal <- newIdxs
      }
    }
    idxReduced <- idxFinal
  }
  return(list(x=x[idxReduced],y=y[idxReduced]))
}

用法：

y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
x <- 1:length(y)
reduced <- heuristicDownSample(x,y,10)

par(mfrow=c(1,2))
plot(x=x,y=y,type="b",main="original")
plot(x=reduced$x,y=reduced$y,type="b",main="reduced",col='red')

score 0 · Accepted Answer

显然您对系统抽样感兴趣。如果是这样，也许以下内容会有所帮助。

set.seed(1234)
n <- 10
step <- floor(length(y)/n)
first <- sample(step, 1)
z <- y[step*(seq_len(n) - 1) + first]

score 0 · Accepted Answer

您可以使用cut生成一个因子来指示您的值属于哪个五分位数（或您想要的任何分位数），然后从那里采样：

df <- data.frame(values = c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23))

cutpoints <- seq(min(df$values), max(df$values), length.out = 5)

> cutpoints
[1] -2.00  4.25 10.50 16.75 23.00

df$quintiles <- cut(df$values, cutpoints, include.lowest = TRUE)

> df
   values   quintiles
1     2.5   [-2,4.25]
2     1.0   [-2,4.25]
3     0.0   [-2,4.25]
4     1.2   [-2,4.25]
5     2.0   [-2,4.25]
6     3.0   [-2,4.25]
7     2.0   [-2,4.25]
8     1.0   [-2,4.25]
9     0.0   [-2,4.25]
10   -2.0   [-2,4.25]
11   -1.0   [-2,4.25]
12    0.5   [-2,4.25]
13    2.0   [-2,4.25]
14    3.0   [-2,4.25]
15    6.0 (4.25,10.5]
16    5.0 (4.25,10.5]
17    7.0 (4.25,10.5]
18    9.0 (4.25,10.5]
19   11.0 (10.5,16.8]
20   15.0 (10.5,16.8]
21   23.0   (16.8,23]

现在您可以split通过计算数据quintiles，计算各组的倾向和样本。

groups <- split(df, df$quintiles)
probs <- prop.table(table(df$quintiles))
nsample <- as.vector(ceiling(probs*10))

> nsample
[1] 7 2 1 1

resample <- function(x, ...) x[sample.int(length(x), ...)]
mysamples <- mapply(function(x, y) resample(x = x, size = y), groups, nsample)
z <- unname(unlist(mysamples))

> z
 [1]  2.0  1.0  0.0  1.0  3.0  0.5  3.0  5.0  9.0 11.0 23.0

由于ceiling()，这可能导致抽样 11 个案例而不是 10 个。

r - 如何以表示 R 中原始向量的一般形状的方式对向量进行子集化

5 回答 5

Related

Reference