1

我正在使用 R 回测一些投资策略,下面有一段脚本:

set.seed(1)
output.df <- data.frame(action=sample(c("initial_buy","sell","buy"),
          10000,replace=TRUE),stringsAsFactors=FALSE)
output.df[,"uid"] <- 1:nrow(output.df)

cutrow.fx <- function(output.df) {
  loop.del <- 2
  while (loop.del <= nrow(output.df)) {
    if ((output.df[loop.del,"action"]=="initial_buy" & 
            output.df[loop.del-1,"action"]=="initial_buy")|
          (output.df[loop.del,"action"]=="sell" & 
            output.df[loop.del-1,"action"]=="sell")|
          (output.df[loop.del,"action"]=="buy" & 
            output.df[loop.del-1,"action"]=="sell")|
          (output.df[loop.del,"action"]=="initial_buy" & 
            output.df[loop.del-1,"action"]=="buy")){
      output.df <- output.df[-loop.del,]
    } else {
      loop.del <- loop.del + 1
    }
  }
output.df<<-output.df
}

print(system.time(cutrow.fx(output.df=output.df)))

该策略将决定:1)何时开始购买股票;2) 何时增加对股票的额外贡献;3) 何时出售所有股票。我有一个数据框,其中包含过去 10 年的股票价格。我写了 3 个脚本来指示我应该在哪个日期买/卖股票,结合 3 个结果和order它们。

我需要删除一些“不可能的动作”,例如我不能在不预先购买新单位的情况下两次出售同一股票,所以我使用上面的脚本删除了那些不可能的动作。但是for循环有点慢。

有什么建议可以加快速度吗?

更新01

我已将其更新cutrow.fx为以下但失败:

cutrow.fx <- function(output.df) {
  output.df[,"action_pre"] <- "NIL"
  output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]                    
  while (any(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy")|
           any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell")|
           any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy")|
           any(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy")) {
    output.df <- output.df[!(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy"),]
    output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell"),]
    output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy"),]
    output.df <- output.df[!(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy"),]
    output.df[,"action_pre"] <- "NIL"
    output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]                    
  }        
  output.df[,"action_pre"] <- NULL
  output.df<<-output.df
}

我使用矢量比较作为某种灵感(我以某种方式使用,因为我不确定我是否得到他在答案中的确切含义),使用 while 循环重复。但是输出不一样。

这里的for循环是不可避免的吗?

4

3 回答 3

2

It looks like all you're doing is checking the last action. This doesn't require a loop at all. All you have to do is shift the vector and do straight vector comparisons. Here's an artificial example.

x <- sample(1:11)
buysell <- sample(c('buy', 'sell'), 11, replace = TRUE)

So, I have 11 samples, x, and whether I've bought or sold them. I want to make a boolean that shows whether I bought or sold the last sample.

bought <- c(NA, buysell[1:10])
which( bought == 'buy' )

Examine the x and buysell variables and you'll see the results here are the index of the x items where a buy was made on the prior item.

Also, you might want to check out he function %in%.

于 2013-01-04T17:03:32.187 回答
2

我试图用矢量化做一些聪明的事情,但失败了,因为循环的先前迭代可以改变数据关系以供以后迭代。因此,我无法将数据滞后一定数量并与实际结果进行比较。

我能做的就是尽量减少所涉及的复制操作。R 是按副本分配的,因此当您编写类似 的语句时output.df <- output.df[-loop.del,],您将复制被删除的每一行的整个数据结构。我没有更改(和复制)数据框,而是更改了逻辑向量。其他一些加快速度的尝试包括使用逻辑与( &&) 而不是按位与( &)、%in%用于进行更少的比较以及最小化对 的访问output.df

为了比较这两个函数,我稍微修改了 OP 解决方案,使得原始数据帧不会被覆盖。看起来这可以将速度提高 10 倍,但仍然需要相当长的时间(>0.5 秒)。我很想看到任何更快的解决方案。

OP的解决方案(返回值略有修改,没有全局赋值)

cutrow.fx <- function(output.df) {
  loop.del <- 2
  while (loop.del <= nrow(output.df)) {
    if ((output.df[loop.del,"action"]=="initial_buy" & 
            output.df[loop.del-1,"action"]=="initial_buy")|
          (output.df[loop.del,"action"]=="sell" & 
            output.df[loop.del-1,"action"]=="sell")|
          (output.df[loop.del,"action"]=="buy" & 
            output.df[loop.del-1,"action"]=="sell")|
          (output.df[loop.del,"action"]=="initial_buy" & 
            output.df[loop.del-1,"action"]=="buy")){
      output.df <- output.df[-loop.del,]
    } else {
      loop.del <- loop.del + 1
    }
  }
return(output.df)
}
ans1 <- cutrow.fx(output.df)

我的解决方案

cutrow.fx2 <- function(output.df) {
    ##edge case if output.df has too few rows
    if (nrow(output.df) < 2) return(output.df)
    ##logical vector of indices of rows to keep
    idx <- c(TRUE,logical(nrow(output.df)-1))
    ##keeps track of the previous row
    prev.row <- 1
    prev.act <- output.df[prev.row,"action"]
    for (current.row in seq_len(nrow(output.df))[-1]) {
        ##access output.df only once per iteration
        current.act <- output.df[current.row,"action"]
        ##checks to see if current row is bad
        ##if so, continue to next row and leave previous row as is
        if ( (prev.act %in% c("initial_buy","buy")) && 
             (current.act == "initial_buy") ) {
            next
        } else if ( (prev.act == "sell") &&
            (current.act %in% c("buy","sell")) ) {
            next
        }
        ##if current row is good, mark it in idx and update previous row
        idx[current.row] <- TRUE
        prev.row <- current.row
        prev.act <- current.act
    }
    return(output.df[idx,])
}
ans2 <- cutrow.fx2(output.df)

检查答案是否相同

identical(ans1,ans2)
## [1] TRUE

#benchmarking
require(microbenchmark)
mb <- microbenchmark(
  ans1=cutrow.fx(output.df)
  ,ans2=cutrow.fx2(output.df),times=50)
print(mb)
# Unit: milliseconds
  # expr       min        lq    median         uq        max
# 1 ans1 9630.1671 9743.1102 9967.6442 10264.7000 12396.5822
# 2 ans2  481.8821  491.6699  500.6126   544.4222   645.9658

plot(mb)
require(ggplot2)
ggplot2::qplot(y=time, data=mb, colour=expr) + ggplot2::scale_y_log10()
于 2013-01-05T20:21:16.940 回答
1

这是一些更简单且更快的代码。它不会遍历所有元素,而只会在匹配之间循环。它向前而不是向后匹配。

首先,修改你的cutrow.fx函数。删除<<-output.df最后一行的,并简单地返回结果。然后您可以运行两个函数并比较结果。

cutrow.fx1 <- function(d) {
  len <- length(d[,1])
  o <- logical(len)
  f <- function(a) {
    switch(a,
           initial_buy=c('buy', 'sell'), 
           buy=c('buy', 'sell'),
           sell='initial_buy'
           )
  }
  cur <- 1
  o[cur] <- TRUE
  while (cur < len) {
    nxt <- match(f(d[cur,1]), d[(cur+1):len,1])
    if (all(is.na(nxt))) {
      break
    } else {
      cur <- cur + min(nxt, na.rm=TRUE);
      o[cur] <- TRUE
    }
  }
  d[o,]
}

证明结果是正确的:

identical(cutrow.fx1(output.df), cutrow.fx(output.df))
## [1] TRUE

而且速度要快得多。这是由于问题的部分向量化,match用于查找要保留的下一行,而不是迭代以丢弃行。

print(system.time(cutrow.fx(output.df)))
##   user  system elapsed 
##  5.688   0.000   5.720 

print(system.time(cutrow.fx1(output.df)))
##   user  system elapsed 
##  1.050   0.000   1.056 
于 2013-01-05T23:59:59.297 回答