6

我有一个包含数百万行的大型数据框。它是时间序列数据。例如:

dates <- c(1,2,3)
purchase_price <- c(5,2,1)
income <- c(2,2,2)
df <- data.frame(dates=dates,price=purchase_price,income=income)

我想创建一个新列,告诉我我每天花了多少钱,并有一些规则,比如“如果我有足够的钱,那就买吧。否则,把钱存起来。”

我目前正在遍历数据框的每一行,并跟踪总的资金。但是,对于大型数据集,这需要很长时间。据我所知,我不能进行向量运算,因为我必须跟踪这个运行变量。

在我正在做的 for 循环内:

balance = balance + row$income
buy_amt = min(balance,row$price)
balance = balance - buy_amt

有没有更快的解决方案?

谢谢!

4

2 回答 2

5

正如保罗所指出的,一些迭代是必要的。您在一个实例和前一个点之间存在依赖关系。

但是,依赖关系仅在购买时发生(阅读:您只需要重新计算余额时......)。因此,您可以“批量”迭代

通过确定下一行有足够的余额进行购买,请尝试以下操作。然后它在一次调用中处理所有先前的行,然后从该点继续。

library(data.table)
DT <- as.data.table(df)

## Initial Balance
b.init <- 2

setattr(DT, "Starting Balance", b.init)

## Raw balance for the day, regardless of purchase
DT[, balance := b.init + cumsum(income)]
DT[, buying  := FALSE]

## Set N, to not have to call nrow(DT) several times
N   <- nrow(DT)

## Initialize
ind <- seq(1:N)

# Identify where the next purchase is
while(length(buys <- DT[ind, ind[which(price <= balance)]]) && min(ind) < N) {
  next.buy <- buys[[1L]] # only grab the first one
  if (next.buy > ind[[1L]]) {
    not.buys <- ind[1L]:(next.buy-1L)
    DT[not.buys, buying := FALSE]
  }
  DT[next.buy, `:=`(buying  = TRUE
                  , balance = (balance - price)
                  ) ]

  # If there are still subsequent rows after 'next.buy', recalculate the balance
  ind <- (next.buy+1) : N
#  if (N > ind[[1]]) {  ## So that
    DT[ind, balance := cumsum(income) + DT[["balance"]][[ ind[[1]]-1L]] ]
#  }
}
# Final row needs to be outside of while-loop, or else will buy that same item multiple times
if (DT[N, !buying && (balance > price)])
  DT[N, `:=`(buying  = TRUE, balance = (balance - price)) ]

结果:

## Show output
{
  print(DT)
  cat("Starting Balance was", attr(DT, "Starting Balance"), "\n")
}


## Starting with 3: 
   dates price income balance buying
1:     1     5      2       0   TRUE
2:     2     2      2       0   TRUE
3:     3     3      2       2  FALSE
4:     4     5      2       4  FALSE
5:     5     2      2       4   TRUE
6:     6     1      2       5   TRUE
Starting Balance was 3

## Starting with 2: 
   dates price income balance buying
1:     1     5      2       4  FALSE
2:     2     2      2       4   TRUE
3:     3     3      2       3   TRUE
4:     4     5      2       0   TRUE
5:     5     2      2       0   TRUE
6:     6     1      2       1   TRUE
Starting Balance was 2


# I modified your original data slightly, for testing
df <- rbind(df, df)
df$dates <- seq_along(df$dates)
df[["price"]][[3]] <- 3
于 2013-10-27T20:37:21.873 回答
4

对于容易用循环表示的问题,我越来越相信 Rcpp 是正确的解决方案。它相对容易上手,您可以非常自然地表达循环算法。

这是使用 Rcpp 解决您的问题的方法:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List purchaseWhenPossible(NumericVector date, NumericVector income, 
                          NumericVector price, double init_balance = 0) {  
  int n = date.length();
  NumericVector balance(n);
  LogicalVector buy(n);

  for (int i = 0; i < n; ++i) {
    balance[i] = ((i == 0) ? init_balance : balance[i - 1]) + income;

    // Buy it if you can afford it
    if (balance[i] >= price[i]) {
      buy[i] = true;
      balance[i] -= price[i];
    } else {
      buy[i] = false;
    }

  }

  return List::create(_["buy"] = buy, _["balance"] = balance);
}

/*** R

# Copying input data from Ricardo
df <- data.frame(
  dates = 1:6,
  income = rep(2, 6),
  price = c(5, 2, 3, 5, 2, 1)
)

out <- purchaseWhenPossible(df$dates, df$income, df$price, 3)
df$balance <- out$balance
df$buy <- out$buy

*/

要运行它,请将其保存到一个名为 的文件purchase.cpp中,然后运行Rcpp::sourceCpp("purchase.cpp")

它会非常快,因为 C++ 是如此之快,但我没有做任何正式的基准测试。

于 2013-10-28T15:03:27.510 回答