4

我有一个包含 150000 行的长格式数据框,其中多次出现相同的 id 变量。我正在使用 reshape(来自 stat,而不是 package=reshape(2))将其转换为宽格式。我正在生成一个变量来计算给定级别 id 的每次出现以用作索引。

我已经使用 plyr 处理了一个小型数据框,但是对于我的完整 df 来说太慢了。我可以更有效地编程吗?

因为我有大约 30 个其他变量,所以我一直在努力使用 reshape 包来做到这一点。对于每个单独的分析,最好只重塑我正在查看的内容(而不是整个 df)。

> # u=id variable with three value variables 
> u<-c(rep("a",4), rep("b", 3),rep("c", 6), rep("d", 5))
> u<-factor(u)
> v<-1:18
> w<-20:37
> x<-40:57
> df<-data.frame(u,v,w,x)
> df
   u  v  w  x
1  a  1 20 40
2  a  2 21 41
3  a  3 22 42
4  a  4 23 43
5  b  5 24 44
6  b  6 25 45
7  b  7 26 46
8  c  8 27 47
9  c  9 28 48
10 c 10 29 49
11 c 11 30 50
12 c 12 31 51
13 c 13 32 52
14 d 14 33 53
15 d 15 34 54
16 d 16 35 55
17 d 17 36 56
18 d 18 37 57
> 
> library(plyr)
> df2<-ddply(df, .(u), transform, count=rank(u, ties.method="first")) 
> df2
   u  v  w  x count
1  a  1 20 40     1
2  a  2 21 41     2
3  a  3 22 42     3
4  a  4 23 43     4
5  b  5 24 44     1
6  b  6 25 45     2
7  b  7 26 46     3
8  c  8 27 47     1
9  c  9 28 48     2
10 c 10 29 49     3
11 c 11 30 50     4
12 c 12 31 51     5
13 c 13 32 52     6
14 d 14 33 53     1
15 d 15 34 54     2
16 d 16 35 55     3
17 d 17 36 56     4
18 d 18 37 57     5
> reshape(df2, idvar="u", timevar="count", direction="wide")
   u v.1 w.1 x.1 v.2 w.2 x.2 v.3 w.3 x.3 v.4 w.4 x.4 v.5 w.5 x.5 v.6 w.6 x.6
1  a   1  20  40   2  21  41   3  22  42   4  23  43  NA  NA  NA  NA  NA  NA
5  b   5  24  44   6  25  45   7  26  46  NA  NA  NA  NA  NA  NA  NA  NA  NA
8  c   8  27  47   9  28  48  10  29  49  11  30  50  12  31  51  13  32  52
14 d  14  33  53  15  34  54  16  35  55  17  36  56  18  37  57  NA  NA  NA
4

2 回答 2

3

我仍然无法完全弄清楚为什么您最终要将数据集从宽转换为长,因为对我来说,这似乎是一个非常难以处理的数据集。

如果您希望加快因子水平的枚举,您可以考虑ave()在基础 R 中使用,或.N从“data.table”包中使用。考虑到您正在处理很多行,您可能需要考虑后者。

首先,让我们整理一些数据:

set.seed(1)
df <- data.frame(u = sample(letters[1:6], 150000, replace = TRUE),
                 v = runif(150000, 0, 10),
                 w = runif(150000, 0, 100),
                 x = runif(150000, 0, 1000))
list(head(df), tail(df))
# [[1]]
#   u        v        w        x
# 1 b 6.368412 10.52822 223.6556
# 2 c 6.579344 75.28534 450.7643
# 3 d 6.573822 36.87630 283.3083
# 4 f 9.711164 66.99525 681.0157
# 5 b 5.337487 54.30291 137.0383
# 6 f 9.587560 44.81581 831.4087
# 
# [[2]]
#        u        v        w        x
# 149995 b 4.614894 52.77121 509.0054
# 149996 f 5.104273 87.43799 391.6819
# 149997 f 2.425936 60.06982 160.2324
# 149998 a 1.592130 66.76113 118.4327
# 149999 b 5.157081 36.90400 511.6446
# 150000 a 3.565323 92.33530 252.4982
table(df$u)
# 
#     a     b     c     d     e     f 
# 25332 24691 24993 24975 25114 24895 

加载我们需要的包:

library(plyr)
library(data.table)

创建我们数据集的“data.table”版本

DT <- data.table(df, key = "u")
DT # Notice that the data are now automatically sorted
#         u         v         w        x
#      1: a 6.2378578 96.098294 643.2433
#      2: a 5.0322400 46.806132 544.6883
#      3: a 9.6289786 87.915303 334.6726
#      4: a 4.3393403  1.994383 753.0628
#      5: a 6.2300123 72.810359 579.7548
#     ---                               
# 149996: f 0.6268414 15.608049 669.3838
# 149997: f 2.3588955 40.380824 658.8667
# 149998: f 1.6383619 77.210309 250.7117
# 149999: f 5.1042725 87.437989 391.6819
# 150000: f 2.4259363 60.069820 160.2324
DT[, .N, by = key(DT)] # Like "table"
#    u     N
# 1: a 25332
# 2: b 24691
# 3: c 24993
# 4: d 24975
# 5: e 25114
# 6: f 24895

现在让我们运行一些基本测试。来自的结果ave()没有排序,但它们在“data.table”和“plyr”中,所以我们还应该测试使用时排序的时间ave()

system.time(AVE <- within(df, {
  count <- ave(as.numeric(u), u, FUN = seq_along)
}))
#    user  system elapsed 
#   0.024   0.000   0.027 

# Now time the sorting
system.time(AVE2 <- AVE[order(AVE$u, AVE$count), ])
#    user  system elapsed 
#   0.264   0.000   0.262 

system.time(DDPLY <- ddply(df, .(u), transform, 
                           count=rank(u, ties.method="first")))
#    user  system elapsed 
#   0.944   0.000   0.984 

system.time(DT[, count := 1:.N, by = key(DT)])
#    user  system elapsed 
#   0.008   0.000   0.004 

all(DDPLY == AVE2)
# [1] TRUE
all(data.frame(DT) == AVE2)
# [1] TRUE

“data.table”的语法确实很紧凑,而且它的速度非常快!

于 2013-01-24T16:10:46.753 回答
1

Using base R to create an empty matrix and then fill it in appropriately can often be significantly faster. In the code below I suspect the slow part would be converting the data frame to a matrix and transposing, as in the first two lines; if so, that could perhaps be avoided if it could be stored differently to start with.

g <- df$a
x <- t(as.matrix(df[,-1]))

k <- split(seq_along(g), g)
n <- max(sapply(k, length))
out <- matrix(ncol=n*nrow(x), nrow=length(k))
for(idx in seq_along(k)) {
  out[idx, seq_len(length(k[[idx]])*nrow(x))] <- x[,k[[idx]]]
}
rownames(out) <- names(k)
colnames(out) <- paste(rep(rownames(x), n), rep(seq_len(n), each=nrow(x)), sep=".")
out
#   b.1 c.1 d.1 b.2 c.2 d.2 b.3 c.3 d.3 b.4 c.4 d.4 b.5 c.5 d.5 b.6 c.6 d.6
# a   1  20  40   2  21  41   3  22  42   4  23  43  NA  NA  NA  NA  NA  NA
# b   5  24  44   6  25  45   7  26  46  NA  NA  NA  NA  NA  NA  NA  NA  NA
# c   8  27  47   9  28  48  10  29  49  11  30  50  12  31  51  13  32  52
# d  14  33  53  15  34  54  16  35  55  17  36  56  18  37  57  NA  NA  NA
于 2013-01-23T15:06:35.897 回答