r - 使用 2 个数据帧的 R 中的有效子集

Question

我在一个数据框中有一个大的时间序列full，在不同的数据框中有一个时间戳列表test。我需要full用围绕时间戳的数据点进行子集化test。我的第一直觉（作为一个 R 菜鸟）是写下面的，这是错误的

subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))

查看结果，我意识到 R 同时遍历两个向量，给出了错误的结果。我的选择是编写如下循环：

subs<-data.frame()
for (j in test$dt) 
  subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))

我觉得可能有更好的方法来做循环，这篇文章恳请我们尽可能避免 R 循环。另一个原因是我可能会遇到性能问题，因为这将是优化算法的核心。大师的任何建议将不胜感激。

编辑：

这是一些可重现的代码，它们显示了错误的方法以及可行但可能更好的方法。

#create a times series
full <- data.frame(seq(1:200),rnorm(200,0,1))
colnames(full)<-c("dt","val")

#my smaller array of points of interest
test <- data.frame(seq(5,200,by=23))
colnames(test)<-c("dt")

# my range around the points of interset
i<-3 

#the wrong approach
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))

#this works, but not sure this is the best way to go about it
subs<-data.frame()
for (j in test$dt) 
  subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))

编辑：我更新了这些值以更好地反映我的用例，并且我看到 @mrdwab 的解决方案出人意料地大幅领先。

我正在使用来自@mrdwab 的基准代码，初始化如下：

set.seed(1)

full <- data.frame(
  dt  = 1:15000000,
  val = floor(rnorm(15000000,0,1))
)


test <- data.frame(dt = floor(runif(24,1,15000000)))

i <- 500

基准是：

       test replications elapsed relative
2    mrdwab            2    1.31  1.00000
3 spacedman            2   69.06 52.71756
1    andrie            2   93.68 71.51145
4  original            2  114.24 87.20611

完全出乎意料。头脑=炸毁。有人可以在这个黑暗的角落里阐明一些光，并启发正在发生的事情。

重要提示：正如@mrdwab 下面所指出的，他的解决方案仅在向量为整数时才有效。如果没有，@spacedman 有正确的解决方案

score 6 · Accepted Answer

这是一个真正的 R 方法。功能上。没有循环...

从 Andrie 的示例数据开始。

一、区间比较函数：

> cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}

一个 OR 组合函数：

> OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}

现在这里有一个循环，用于构建这些比较函数的列表：

> funs = mapply(cf,test$dt-i,test$dt+i)

现在将所有这些组合成一个函数：

> anyF = Reduce(OR,funs)

现在我们将 OR 组合应用于我们的区间测试函数：

> head(full[anyF(full$dt),])
   dt         val
3   3 -0.83562861
4   4  1.59528080
5   5  0.32950777
6   6 -0.82046838
7   7  0.48742905
26 26 -0.05612874

你现在得到的是一个单一变量的函数，它测试值是否在你定义的范围内。

> anyF(1:10)
 [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

我不知道这是否更快，或者更好，或者什么。有人做一些基准测试！

score 4 · Accepted Answer

我不知道它是否更有效率，但我认为你也可以做这样的事情来得到你想要的：

subs <- apply(test, 1, function(x) c((x-2):(x+2)))
full[which(full$dt %in% subs), ]

我不得不将您的“3”调整为“2”，因为x这两种方式都包含在内。

基准测试（只是为了好玩）

@Spacedman 带路！

首先，所需的数据和功能。

## Data
set.seed(1)

full <- data.frame(
  dt  = 1:200,
  val = rnorm(200,0,1)
)

test <- data.frame(dt = seq(5,200,by=23))

i <- 3 

## Spacedman's functions
cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
funs = mapply(cf,test$dt-i,test$dt+i)
anyF = Reduce(OR,funs)

二是对标。

## Benchmarking
require(rbenchmark)
benchmark(andrie = do.call(rbind, 
                           lapply(test$dt, 
                                  function(j) full[full$dt > (j-i) & 
                                    full$dt < (j+i), ])),
          mrdwab = {subs <- apply(test, 1, 
                                  function(x) c((x-(i-1)):(x+(i-1))))
                    full[which(full$dt %in% subs), ]},
          spacedman = full[anyF(full$dt),],
          original = {subs <- data.frame()
                      for (j in test$dt) 
                        subs <- rbind(subs, 
                                      subset(full, full$dt > (j-i) & 
                                        full$dt < (j+i)))},
          columns = c("test", "replications", "elapsed", "relative"),
          order = "relative")
#        test replications elapsed  relative
# 3 spacedman          100   0.064  1.000000
# 2    mrdwab          100   0.105  1.640625
# 1    andrie          100   0.520  8.125000
# 4  original          100   1.080 16.875000

score 4 · Accepted Answer

您的代码本质上没有任何问题。为了实现您的目标，您需要围绕矢量化子集操作进行某种循环。

但这里有更多的 R-ish 方式，它可能会更快：

do.call(rbind, 
  lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)

PS：您可以显着简化可重现的示例：

set.seed(1)

full <- data.frame(
  dt  = 1:200,
  val = rnorm(200,0,1)
)

test <- data.frame(dt = seq(5,200,by=23))

i <- 3 

xx <- do.call(rbind, 
  lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)

head(xx)
   dt         val
3   3 -0.83562861
4   4  1.59528080
5   5  0.32950777
6   6 -0.82046838
7   7  0.48742905
26 26 -0.05612874

score 0 · Accepted Answer

使用 data.tables 的另一种方法：

{
temp <- data.table(x=unique(c(full$dt,(test$dt-i),(test$dt+i))),key="x")
temp[,index:=1:nrow(temp)]
startpoints <- temp[J(test$dt-i),index]$index
endpoints <- temp[J(test$dt+i),index]$index
allpoints <- as.vector(mapply(FUN=function(x,y) x:y,x=startpoints,y=endpoints))
setkey(x=temp,index)
ans <- temp[J(allpoints)]$x
}

基准：测试行数：9 完整行数：10000

       test replications elapsed relative
1 spacedman          100   0.406    1.000
2       new          100   1.179    2.904

完整行数：100000

       test replications elapsed relative
2       new          100   2.374    1.000
1 spacedman          100   3.753    1.581

r - 使用 2 个数据帧的 R 中的有效子集

4 回答 4

基准测试（只是为了好玩）

Related

Reference