r - 是否可以优化（矢量化）这两个函数以获得更好的性能

Question

在我第一次尝试使用 RI 时，我猜想编写了两个性能不是很好的函数，如果我能收到一些关于如何使它们更具性能（矢量化）的提示，我将不胜感激。这两个函数最后都带有“测试用例”。

第一个函数采用两个时间序列 xts 对象 x 和 y 并返回一个序列，其中包含有关 x 高于/低于 y 天数的数据。

require('xts')
require('quantmod')

countDaysBelowOrAbove <- function(x, y) {
    x <- try.xts(x, error=as.matrix)
    y <- try.xts(y, error=as.matrix)

    if(is.xts(x) && is.xts(y)) {
        xy <- cbind(x,y)
    } else {
        xy <- cbind( as.vector(x), as.vector(y) )
    }

    # Count NAs, ensure they're only at beginning of data, then remove.
    xNAs <- sum( is.na(x) )
    yNAs <- sum( is.na(y) )
    NAs <- max( xNAs, yNAs )
    if( NAs > 0 ) {
        if( any( is.na(xy[-(1:NAs),]) ) ) stop("Series contain non-leading NAs")
    }

    resultDaysLower <- x
    resultDaysHigher <- x
    resultDaysLower[!is.na(resultDaysLower)]<-0
    resultDaysHigher[!is.na(resultDaysHigher)]<-0

    series<-cbind(xy, resultDaysLower, resultDaysHigher)
    colnames(series) <- c(names(xy), "cumDaysLower", "cumDaysHigher")

    daysLower = 0
    daysHigher = 0

    for (i in 1:NROW(xy)) {
        if (!(is.na(series[,1][i]) | is.na(series[,2][i]))) {
            if (series[,1][i] >= series[,2][i]) {
                daysLower = 0
                daysHigher = daysHigher + 1
            }
            else {
                daysHigher = 0
                daysLower = daysLower + 1
            }
        }
        else {
            daysLower = 0
            daysHigher = 0
        }
        series$cumDaysLower[i] = daysLower
        series$cumDaysHigher[i] = daysHigher                
    }
    return(series)
}

getSymbols("SPY", from='2005-01-01')
SPYclose = Cl(SPY)

getSymbols("QQQQ", from='2005-01-01')
QQQQclose = Cl(QQQQ)

testData = countDaysBelowOrAbove(SPYclose, QQQQclose)

我希望获得性能优化帮助的第二个功能如下。该函数将 xts 对象系列和表示间隔长度的 xts 对象作为参数，以计算指定时间的系列的最小值。该函数返回具有指定窗口的序列的计算最小值，用于以长度为单位的最小计算集。

minimumWithVaryingLength<-function(series, lengths) {
    series <- try.xts(series, error=as.matrix)
    lengths <- try.xts(lengths, error=as.matrix)

    if(is.xts(series) && is.xts(lengths)) {
        serieslengths <- cbind(series,lengths)
    } else {
        serieslengths <- cbind( as.vector(series), as.vector(lengths) )
    }

    # Count NAs, ensure they're only at beginning of data, then remove.
    seriesNAs <- sum( is.na(series) )
    lengthsNAs <- sum( is.na(lengths) )
    NAs <- max( seriesNAs, lengthsNAs )
    if( NAs > 0 ) {
        if( any( is.na(serieslengths[-(1:NAs),]) ) ) stop("Series contain non-leading NAs")
    }

    result <- series
    result[!is.na(result)]<-0

    for (i in 1:NROW(serieslengths)) {  
        if (lengths[i] > 0) {
            result[i] <- runMin(series, n=lengths[i], cumulative=FALSE)[i]
        }
        else {
            result[i] <- 0
        }
    }

    return(result)
}

getSymbols("SPY", from='2005-01-01')
SPYclose = Cl(SPY)

getSymbols("QQQQ", from='2005-01-01')
QQQQclose = Cl(QQQQ)

numDaysBelow = countDaysBelowOrAbove(SPYclose, QQQQclose)
test = minimumWithVaryingLength(SPYclose, numDaysBelow)

提前感谢您的帮助。

亲切的问候，萨摩。

score 4 · Accepted Answer

对于第一个函数，您正在寻找序列x低于/高于y. 为此，您可以使用CumCount()从cummax. 首先是一些示例数据：

set.seed(1)
x <- sample(1:5,20,T)
y <- sample(1:5,20,T)

CumCount <- function(x) {
  z <- cumsum(x)
  z - cummax(z*(!x))
}

CumLow = CumCount(x<y)
CumHigh = CumCount(x>y)

对于您的第二次 x计算，您试图在x < y. 为此，该rle功能非常有用（“运行长度编码”）。

# runs equals the length of each phase (x < y or x > y)
runs <- rle(CumLow > 0)$lengths
# starts is the number of periods prior to each phase...
starts <- c(0,cumsum(runs)[-length(runs)]) 
#... which we use to build "blocks", a list of indices of each phase.
blocks <- mapply( function(x,y) x+y, starts, lapply(runs,seq))
# now apply the cummin function within each block:
# (remember to mask it by CumLow > 0 -- 
#   we only want to do this within the x<y phase)
BlockCumMin <- unlist(sapply(blocks, function(blk) cummin(x[blk]))) * (CumLow > 0)

现在我们把它们放在一起：

  > cbind(x,y, CumLow, CumHigh, BlockCumMin)

      x y CumLow CumHigh BlockCumMin
 [1,] 3 4      1       0           3
 [2,] 4 2      0       1           0
 [3,] 2 2      0       0           0
 [4,] 2 5      1       0           2
 [5,] 4 4      0       0           0
 [6,] 2 2      0       0           0
 [7,] 4 1      0       1           0
 [8,] 1 3      1       0           1
 [9,] 2 5      2       0           1
[10,] 1 3      3       0           1
[11,] 2 5      4       0           1
[12,] 1 4      5       0           1
[13,] 4 2      0       1           0
[14,] 5 3      0       2           0
[15,] 4 1      0       3           0
[16,] 4 1      0       4           0
[17,] 3 4      1       0           3
[18,] 3 1      0       1           0
[19,] 5 3      0       2           0
[20,] 4 4      0       0           0

请注意，此问题与此问题有关

更新。对于更一般的情况，您有一个series向量，一个lengths向量（与相同的长度series），并且您想要生成一个名为BlockMinswhereBlockMins[i]is the minimum of thelengths[i]block ofseriesend at positioni，您可以执行以下操作。由于长度是任意的，这不再是累积最小值；对于每一个i你必须在位置结束length[i]元素中：seriesi

set.seed(1)
series <- sample(1:5,20,T)
lengths <- sample(3:5,20,T)
BlockMins <- sapply(seq_along(lengths), 
                    function(i) min( series[ i : max(1, (i - lengths[i]+1)) ]) )
> cbind(series, lengths, BlockMins)
      series lengths BlockMins
 [1,]      1       5         1
 [2,]      1       4         1
 [3,]      3       3         1
 [4,]      4       4         1
 [5,]      5       3         3
 [6,]      1       4         1
 [7,]      1       5         1
 [8,]      4       3         1
 [9,]      2       5         1
[10,]      2       4         1
[11,]      1       5         1
[12,]      2       5         1
[13,]      2       3         1
[14,]      2       4         1
[15,]      4       5         1
[16,]      3       5         2
[17,]      5       3         3
[18,]      1       4         1
[19,]      5       3         1
[20,]      3       3         1

score 1 · Accepted Answer

在不处理时间序列设备的情况下，如果您有两个向量 x 和 y，并且想要“返回一个包含 x 高于/低于 y 天数的数据的序列”，只需比较它们：

# Make up some data
x <- seq(100)
y <- x[sample(x)]
# Compare
x.greater <- sum(x>y)
x.lesser <- sum(x<y)

关键是当你对一个逻辑向量求和时，例如 (x>y)，R 将 TRUE 强制为 1，将 FALSE 强制为 0。

r - 是否可以优化（矢量化）这两个函数以获得更好的性能

2 回答 2

Related

Reference