r - R data.table efficient replication by group

Question

I am running into some memory allocation problems trying to replicate some data by groups using data.table and rep.

Here is some sample data:

ob1 <- as.data.frame(cbind(c(1999),c("THE","BLACK","DOG","JUMPED","OVER","RED","FENCE"),c(4)),stringsAsFactors=FALSE)
ob2 <- as.data.frame(cbind(c(2000),c("I","WALKED","THE","BLACK","DOG"),c(3)),stringsAsFactors=FALSE)
ob3 <- as.data.frame(cbind(c(2001),c("SHE","PAINTED","THE","RED","FENCE"),c(1)),stringsAsFactors=FALSE)
ob4 <- as.data.frame(cbind(c(2002),c("THE","YELLOW","HOUSE","HAS","BLACK","DOG","AND","RED","FENCE"),c(2)),stringsAsFactors=FALSE)
sample_data <- rbind(ob1,ob2,ob3,ob4)
colnames(sample_data) <- c("yr","token","multiple")

What I am trying to do is replicate the tokens (in the present order) by the multiple for each year.

The following code works and gives me the answer I want:

good_solution1 <- ddply(sample_data, "yr", function(x) data.frame(rep(x[,2],x[1,3])))

good_solution2 <- data.table(sample_data)[, rep(token,unique(multiple)),by = "yr"]

The issue is that when I scale this up to 40mm+ rows, I get into memory issues for both possible solutions.

If my understanding is correct, these solutions are essentially doing an rbind which allocates everytime.

Does anyone have a better solution?

I looked at set() for data.table but was running into issues because I wanted to keep the tokens in the same order for each replication.

score 3 · Accepted Answer

一种方法是：

require(data.table)
dt <- data.table(sample_data)
# multiple seems to be a character, convert to numeric
dt[, multiple := as.numeric(multiple)]
setkey(dt, "multiple")
dt[J(rep(unique(multiple), unique(multiple))), allow.cartesian=TRUE]

除了最后一行之外的所有内容都应该是直截了当的。最后一行在的帮助下使用键列使用子集J(.)。对于J(.)对应值中的每个值，都与“键列”匹配，并返回匹配的子集。

也就是说，如果你这样做，dt[J(1)]你会得到子集 where multiple = 1。如果你仔细注意，这样做dt[J(rep(1,2)]会给你相同的子集，但是两次。dt[J(1,1)]请注意，传递和之间存在差异dt[J(rep(1,2)]。前者是将 (1,1) 的值分别与 data.table的前两个键列匹配，而后者是通过将 (1 和 2) 与数据的第一个键列匹配来进行子集化。桌子。

因此，如果我们在中将相同的列值传递 2 次J(.)，那么它会被复制两次。我们使用这个技巧来传递 1 1-time、2 2-times 等。这就是该rep(.)部分的作用。rep(.)给出 1,2,2,3,3,3,4,4,4,4.

如果连接产生的行数多于max(nrow(dt), nrow(i))（i 是内部的代表向量J(.)），则您必须明确使用allow.cartesian = TRUE来执行此连接（我猜这是 data.table 1.8.8 的新功能）。

编辑：这是我对“相对”大数据所做的一些基准测试。我没有看到这两种方法的内存分配有任何峰值。但是我还没有找到一种方法来监控 R 函数中的峰值内存使用情况。我确信我已经在 SO 上看到过这样的帖子，但它现在让我失望了。我会再回信。现在，这里有一个测试数据和一些初步结果，以防有人有兴趣/想自己运行它。

# dummy data
set.seed(45)
yr <- 1900:2013
sz <- sample(10:50, length(yr), replace = TRUE)
token <- unlist(sapply(sz, function(x) do.call(paste0, data.frame(matrix(sample(letters, x*4, replace=T), ncol=4)))))
multiple <- rep(sample(500:5000, length(yr), replace=TRUE), sz)

DF <- data.frame(yr = rep(yr, sz), 
                 token = token, 
                 multiple = multiple, stringsAsFactors=FALSE)

# Arun's solution
ARUN.DT <- function(dt) {
    setkey(dt, "multiple")
    idx <- unique(dt$multiple)
    dt[J(rep(idx,idx)), allow.cartesian=TRUE]
}

# Ricardo's solution
RICARDO.DT <- function(dt) {
    setkey(dt, key="yr")
    newDT <- setkey(dt[, rep(NA, list(rows=length(token) * unique(multiple))), by=yr][, list(yr)], 'yr')
    newDT[, tokenReps := as.character(NA)]

    # Add the rep'd tokens into newDT, using recycling
    newDT[, tokenReps := dt[.(y)][, token], by=list(y=yr)]
    newDT
}

# create data.table
require(data.table)
DT <- data.table(DF)

# benchmark both versions
require(rbenchmark)
benchmark(res1 <- ARUN.DT(DT), res2 <- RICARDO.DT(DT), replications=10, order="elapsed")

#                     test replications elapsed relative user.self sys.self
# 1    res1 <- ARUN.DT(DT)           10   9.542    1.000     7.218    1.394
# 2 res2 <- RICARDO.DT(DT)           10  17.484    1.832    14.270    2.888

但正如里卡多所说，内存不足可能并不重要。因此，在这种情况下，必须在速度和内存之间进行权衡。我想验证的是这里两种方法中使用的峰值内存，以确定使用Join是否更好。

score 1 · Accepted Answer

您可以尝试先为所有行分配内存，然后迭代地填充它们。
例如：

  # make sure `sample_data$multiple` is an integer
  sample_data$multiple <- as.integer(sample_data$multiple)

  # create data.table
  S <- data.table(sample_data, key='yr')

  # optionally, drop original data.frame if not needed
  rm(sample_data)

  ## Allocate the memory first
  newDT <- data.table(yr = rep(sample_data$yr, sample_data$multiple), key="yr")
  newDT[, tokenReps := as.character(NA)]

  # Add the rep'd tokens into newDT, using recycling
  newDT[, tokenReps := S[.(y)][, token], by=list(y=yr)]

两个注意事项：

(1) sample_data$multiple当前是一个字符，因此在传递给rep（在您的原始示例中）时会被强制。如果情况也是如此，可能值得仔细检查您的真实数据。

(2) 我使用以下方法确定每年所需的行数

S[, list(rows=length(token) * unique(multiple)), by=yr]

r - R data.table efficient replication by group

2 回答 2

两个注意事项：

Related

Reference