13

我正在处理一个简单的表格

date         variable   value
1970-01-01   V1         0.434
1970-01-01   V2         12.12
1970-01-01   V3         921.1
1970-01-02   V1         -1.10
1970-01-03   V3         0.000
1970-01-03   V5         312e6
...          ...        ...

对(日期,变量)是唯一的。我想把这张桌子变成一张宽幅桌子。

date         V1         V2         V3         V4         V5        
1970-01-01   0.434      12.12      921.1      NA         NA
1970-01-02   -1.10      NA         NA         NA         NA
1970-01-03   0.000      NA         NA         NA         312e6

而且我想以最快的方式做到这一点,因为我必须在具有 1e6 条记录的表上反复重复该操作。tapply()在 R 原生模式下,我相信reshape()d*ply()都在速度方面由data.table. 我想针对基于 sqlite 的解决方案(或其他数据库)测试后者的性能。以前有这样做过吗?有性能提升吗?而且,当“宽”字段(日期)的数量是可变的并且事先不知道时,如何在 sqlite 中将高到宽转换?

4

2 回答 2

4

我使用的方法基于什么tapply,但速度快了一个数量级(主要是因为没有每个单元格的函数调用)。

使用tallPrasad 帖子的时间安排:

pivot = function(col, row, value) {
  col = as.factor(col)
  row = as.factor(row)
  mat = array(dim = c(nlevels(row), nlevels(col)), dimnames = list(levels(row), levels(col)))
  mat[(as.integer(col) - 1L) * nlevels(row) + as.integer(row)] = value
  mat
}

> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
  11.31    0.03   11.36 

> system.time( replicate(100, wide <- with(tall, pivot(tkr, dt, value))))
   user  system elapsed 
    0.9     0.0     0.9 

关于订购可能出现的问题,应该没有任何问题:

> a <- with(tall, pivot(tkr, dt, value))
> b <- with(tall[sample(nrow(tall)), ], pivot(tkr, dt, value))
> all.equal(a, b)
[1] TRUE
于 2011-03-15T13:51:01.290 回答
1

A few remarks. A couple of SO questions address how to do tall-to-wide pivoting in Sql(ite): here and here. I haven't looked at those too deeply but my impression is that doing it in SQL is ugly, as in: your sql query needs to explicitly mention all possible keys in the query! (someone please correct me if I'm wrong). As for data.table, you can definitely do group-wise operations very fast, but I don't see how you can actually cast the result into a wide format.

If you want to do it purely in R, I think tapply is the speed champ here, much faster than acast from reshape2:

Create some tall data, with some holes in it just to make sure the code is doing the right thing:

tall <- data.frame( dt = rep(1:100, 100),
                     tkr = rep( paste('v',1:100,sep=''), each = 100),
                     value = rnorm(1e4)) [-(1:5), ]


> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
   4.73    0.00    4.73 

> system.time( replicate(100, wide <- acast( tall, tkr ~ dt)))
   user  system elapsed 
   7.93    0.03    7.98 
于 2011-03-15T13:37:50.757 回答