r - 透视大数据表

Question

我在 R 中有一个大型数据表：

library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
  ID=sample(1:200000, n, replace=TRUE), 
  Month=sample(1:12, n, replace=TRUE),
  Category=sample(1:1000, n, replace=TRUE),
  Qty=runif(n)*500,
  key=c('ID', 'Month')
)
dim(DT)

我想旋转这个data.table，这样Category就变成了一列。不幸的是，由于组内的类别数量不是恒定的，我不能使用这个答案。

任何想法我可以如何做到这一点？

/edit：根据 joran 的评论和 flodel 的回答，我们真的在重塑以下内容data.table：

agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]

这种重塑可以通过多种方式完成（到目前为止我已经得到了一些很好的答案），但我真正想要的是能够很好地扩展到data.table具有数百万行和数百到数千个类别的东西。

score 9 · Accepted Answer

data.table实现更快版本的melt/dcastdata.table 特定方法（在 C 中）。它还增加了用于熔化和铸造多根柱子的附加功能。请参阅使用 data.tables小插图进行高效重塑。

请注意，我们不需要加载reshape2包。

library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
  ID=sample(1:200000, n, replace=TRUE), 
  Month=sample(1:12, n, replace=TRUE),
  Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit
  Qty=runif(n),
  key=c('ID', 'Month')
)
dim(DT)

> system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum))
#   user  system elapsed
# 65.924  20.577  86.987
> dim(ans)
# [1] 2399401     802

score 3 · Accepted Answer

像那样？

agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]

reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
        timevar = "Category", direction = "wide")

score 3 · Accepted Answer

没有data.table具体的宽整形方法。

这是一种可行的方法，但它是相当值得考虑的。

有一个功能请求#2619 为 LHS 划定范围，:=以帮助使其更简单。

这是一个简单的例子

# a data.table
DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6))
# with not all categories represented
DDD <- DD[1:5]
# trying to make `a` columns containing `cc`. retaining `b` as a column
# the unique values of `a` (you may want to sort this...)
nn <- unique(DDD[,a])
# create the correct wide data.table
# with NA of the correct class in each created column
rows <- max(DDD[, .N,  by = list(a,b)][,N])
DDw <- DDD[, setattr(replicate(length(nn), {
                     # safe version of correct NA  
                     z <- cc[1]
                      is.na(z) <-1
                     # using rows value calculated previously
                     # to ensure correct size
                       rep(z,rows)}, 
                    simplify = FALSE), 'names', nn),
           keyby = list(b)]
# set key for binary search
setkey(DDD, b, a)
# The possible values of the b column
ub <- unique(DDw[,b])
# nested loop doing things by reference, so should be 
# quick (the feature request would make this possible to 
# speed up using binary search joins.
for(ii in ub){
  for(jj in nn){
    DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]]
  }
}

DDw
#    b  d e  f
# 1: a  1 2  3
# 2: a  4 2  3
# 3: b NA 5 NA
# 4: b NA 5 NA

score 2 · Accepted Answer

编辑

我发现了这个SO post，其中包括将丢失的行插入 data.table 的更好方法。功能fun_DT相应调整。代码现在更干净了；我没有看到任何速度改进。

在另一篇文章中查看我的更新。Arun 的解决方案同样有效，但您必须手动插入缺失的组合。由于这里有更多标识符列（ID、Month），我在这里只提出了一个肮脏的解决方案（首先创建一个 ID2，然后创建所有 ID2-Category 组合，然后填充 data.table，然后进行整形）。

我很确定这不是最好的解决方案，但如果这个 FR是内置的，这些步骤可能会自动完成。

解决方案的速度大致相同，尽管看看它是如何扩展的会很有趣（我的机器太慢了，所以我不想进一步增加 n ......计算机已经经常崩溃了 ;-)

library(data.table)
library(rbenchmark)

fun_reshape <- function(n) {

  DT <- data.table(
    ID=sample(1:100, n, replace=TRUE), 
    Month=sample(1:12, n, replace=TRUE),
    Category=sample(1:10, n, replace=TRUE),
    Qty=runif(n)*500,
    key=c('ID', 'Month')
  )
  agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
  reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
          timevar = "Category", direction = "wide")
}

#UPDATED!
fun_DT <- function(n) {

  DT <- data.table(
    ID=sample(1:100, n, replace=TRUE), 
    Month=sample(1:12, n, replace=TRUE),
    Category=sample(1:10, n, replace=TRUE),
    Qty=runif(n)*500,
    key=c('ID', 'Month')
  ) 

  agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
  agg[, ID2 := paste(ID, Month, sep="_")]

  setkey(agg, ID2, Category)
  agg <- agg[CJ(unique(ID2), unique(Category))]

  agg[, as.list(setattr(Qty, 'names', Category)), by=list(ID2)]

}

library(rbenchmark)

n <- 1e+07
benchmark(replications=10,
          fun_reshape(n),
          fun_DT(n))
            test replications elapsed relative user.self sys.self user.child sys.child
2      fun_DT(n)           10  45.868        1    43.154    2.524          0         0
1 fun_reshape(n)           10  45.874        1    42.783    2.896          0         0

r - 透视大数据表

4 回答 4

Related

Reference