r - r中的线性距离数据中的条件过滤/子集数据

Question

这是我的小例子：......

Mark <- paste ("SN", 1:400, sep = "") 
highway <- rep(1:4, each = 100)
set.seed (1234)
MAF <- rnorm (400, 0.3, 0.1)
PPC <- abs (ceiling( rnorm (400, 5, 5)))

set.seed (1234)
Position  <- round(c(cumsum (rnorm (100, 5, 3)), 
cumsum (rnorm (100, 10, 3)), cumsum (rnorm (100, 8, 3)),
  cumsum (rnorm (100, 6, 3))), 1)

mydf <- data.frame (Mark, highway, Position, MAF, PPC)

我想为 PPC 过滤小于 10 的数据，同时为 MAF 过滤大于 0.3 的数据。

  # filter PPC < 10 & MAF > 0.3 
 filtered <-  mydf[mydf$PPC < 10  & mydf$MAF > 0.3,]

我有分组变量 - 高速公路，每个标记在高速公路上都有位置。例如前五个标记的高速公路 1：

      1.4     7.2      15.5 13.4 19.7
 |-----|.......|.......|.....|.....|
      "SN1" "SN2"   "SN3"  "SN4" "SN5"

现在我想根据每条高速公路上的位置（考虑不同的高速公路长度）选择任何 ~ 30 个标记，以便它们在每条高速公路上分布良好，并且两个选择之间的最小距离不小于 10。

编辑：想法（粗略）在此处输入图像描述

我可以考虑一下如何解决这个问题。帮助表示赞赏。

编辑：这是我能想到的：

# The maximum (length) of each highway is: 
out <-  tapply(mydf$Position, mydf$highway, max)
out 
     1      2      3      4 
 453.0 1012.4  846.4  597.6 

min(out)
[1] 453

 #Total length of all highways 
totallength <- sum(out)

# Thus average distance at which mark need to be placed:
totallength / 30 
[1] 96.98

对于 1 号高速公路，理论标记可能位于：

 96.98, 96.98+ 96.98, 96.98+96.98+ 96.98, ........till it is less
    than maximum (length )for highway 1.

因此理论上我们需要在每 96.98 处选择标记。但是放在高速公路上的标记可能不会被发现

注意：分数选择的总结果不必正好是30（大约30）

score 3 · Accepted Answer

由于我们不关心任何其他列，因此如果我们使用 split 来获取位置列表，代码会更容易一些。

filtered$highway <- factor(filtered$highway)
positions <- with(filtered, split(Position, highway))

使用每条高速公路的相对长度可以找到每条高速公路中合适数量的标记。

highway_lengths <- sapply(positions, max)
total_length <- sum(highway_lengths)
n_marks_per_highway <- round(30 * highway_lengths / total_length)

我们可以使用分位数函数来获得沿每条高速公路均匀分布的目标点。

target_mark_points <- mapply(
  function(pos, n)
  {
    quantile(pos, seq.int(0, 1, 1 / (n - 1)))
  },
  positions,
  n_marks_per_highway
)

对于每个目标点，我们在高速公路上找到最近的现有标记。

actual_mark_points <- mapply(
  function(pos, target)  
  {
    sapply(target, function(tgt) 
    {
      d <- abs(tgt - pos)
      pos[which.min(d)]
    })
  },
  positions,
  target_mark_points
)

只是为了看看它是否有效，您可以可视化标记。

is_mark_point <- mapply(
  function(pos, mark)
  {
    pos %in% mark
  },
  positions,
  actual_mark_points
)

filtered$is.mark.point <- unsplit(is_mark_point, filtered$highway)

library(ggplot2)    
(p <- ggplot(filtered, aes(Position, highway, colour = is.mark.point)) +
  geom_point()
)

r - r中的线性距离数据中的条件过滤/子集数据

1 回答 1

Related

Reference