1

我有大量的历史客人约会记录(> 1 百万条目)。每行记录客人的ID,约会的日期,约会的状态(1表示出现,0表示未出现)示例如下(测试)。

我需要计算每位客人在当前约会之前进行的约会次数。我按日期升序对数据进行了排序。

我尝试使用 data.table() 进行计算。计算结果的示例如下所示为 testWithVisit。我使用的方法适用于小型数据表。但是对于超过 10000 个结果的数据表来说速度很慢。我无法完成所有 1 百万行的计算。我可以知道是否有人对此有一个优雅的解决方案?提前致谢。

我用来生成测试数据和计算 testWithVisit 的代码在底部。

> test
      ID                Date Status Index
 1: 1002 2012-01-11 03:46:27      1     1
 2: 1001 2012-02-17 10:15:59      1     2
 3: 1002 2012-02-26 13:18:42      1     3
 4: 1001 2012-02-27 18:48:00      1     4
 5: 1004 2012-03-11 05:40:36      1     5
 6: 1004 2012-03-17 06:06:05      0     6
 7: 1008 2012-03-17 14:41:53      0     7
 8: 1008 2012-03-21 13:55:51      1     8
 9: 1008 2012-03-22 22:30:42      0     9
10: 1005 2012-03-29 09:00:39      1    10
11: 1005 2012-04-04 02:46:54      1    11
12: 1004 2012-04-05 22:53:05      1    12
13: 1006 2012-04-11 19:53:10      0    13
14: 1007 2012-04-14 17:19:07      1    14
15: 1003 2012-04-16 08:28:26      1    15
16: 1007 2012-04-16 19:26:57      1    16
17: 1001 2012-04-17 15:43:26      1    17
18: 1008 2012-04-21 07:12:20      0    18
19: 1004 2012-04-26 06:44:01      0    19
20: 1001 2012-05-10 13:17:56      1    20
21: 1005 2012-05-10 18:56:17      1    21
22: 1008 2012-05-11 08:58:28      1    22
23: 1001 2012-05-16 08:20:22      1    23
24: 1003 2012-06-06 04:15:58      1    24
25: 1006 2012-06-11 12:01:15      1    25
26: 1008 2012-06-20 14:06:22      1    26
27: 1002 2012-06-21 05:18:20      1    27
28: 1008 2012-06-29 16:07:28      0    28
29: 1002 2012-07-02 09:42:15      1    29
30: 1005 2012-07-06 22:45:24      1    30
31: 1007 2012-07-08 01:51:51      1    31
32: 1001 2012-08-12 07:04:49      1    32
33: 1006 2012-08-29 04:09:09      1    33
34: 1006 2012-09-25 19:37:58      0    34
35: 1003 2012-10-07 06:20:29      0    35
36: 1002 2012-10-08 19:16:35      0    36
37: 1001 2012-10-11 07:38:40      0    37
38: 1001 2012-10-24 10:58:16      0    38
39: 1005 2012-10-28 16:28:39      0    39
40: 1008 2012-10-30 01:57:52      1    40
41: 1006 2012-11-04 09:14:35      1    41
42: 1007 2012-11-11 10:56:59      0    42
43: 1008 2012-11-13 17:05:58      0    43
44: 1001 2012-11-17 08:38:36      1    44
45: 1005 2012-11-26 02:49:51      1    45
46: 1008 2012-11-26 06:12:53      0    46
47: 1005 2012-11-29 17:34:43      1    47
48: 1001 2012-11-29 23:25:36      0    48
49: 1006 2012-12-14 17:35:57      0    49
50: 1002 2012-12-19 08:36:07      1    50
      ID                Date Status Index
> testWithVisit
    Index   ID                Date Status Num_Visit Num_Show
 1:     1 1002 2012-01-11 03:46:27      1         0        0
 2:     2 1001 2012-02-17 10:15:59      1         0        0
 3:     3 1002 2012-02-26 13:18:42      1         1        1
 4:     4 1001 2012-02-27 18:48:00      1         1        1
 5:     5 1004 2012-03-11 05:40:36      1         0        0
 6:     6 1004 2012-03-17 06:06:05      0         1        1
 7:     7 1008 2012-03-17 14:41:53      0         0        0
 8:     8 1008 2012-03-21 13:55:51      1         1        0
 9:     9 1008 2012-03-22 22:30:42      0         2        1
10:    10 1005 2012-03-29 09:00:39      1         0        0
11:    11 1005 2012-04-04 02:46:54      1         1        1
12:    12 1004 2012-04-05 22:53:05      1         2        1
13:    13 1006 2012-04-11 19:53:10      0         0        0
14:    14 1007 2012-04-14 17:19:07      1         0        0
15:    15 1003 2012-04-16 08:28:26      1         0        0
16:    16 1007 2012-04-16 19:26:57      1         1        1
17:    17 1001 2012-04-17 15:43:26      1         2        2
18:    18 1008 2012-04-21 07:12:20      0         3        1
19:    19 1004 2012-04-26 06:44:01      0         3        2
20:    20 1001 2012-05-10 13:17:56      1         3        3
21:    21 1005 2012-05-10 18:56:17      1         2        2
22:    22 1008 2012-05-11 08:58:28      1         4        1
23:    23 1001 2012-05-16 08:20:22      1         4        4
24:    24 1003 2012-06-06 04:15:58      1         1        1
25:    25 1006 2012-06-11 12:01:15      1         1        0
26:    26 1008 2012-06-20 14:06:22      1         5        2
27:    27 1002 2012-06-21 05:18:20      1         2        2
28:    28 1008 2012-06-29 16:07:28      0         6        3
29:    29 1002 2012-07-02 09:42:15      1         3        3
30:    30 1005 2012-07-06 22:45:24      1         3        3
31:    31 1007 2012-07-08 01:51:51      1         2        2
32:    32 1001 2012-08-12 07:04:49      1         5        5
33:    33 1006 2012-08-29 04:09:09      1         2        1
34:    34 1006 2012-09-25 19:37:58      0         3        2
35:    35 1003 2012-10-07 06:20:29      0         2        2
36:    36 1002 2012-10-08 19:16:35      0         4        4
37:    37 1001 2012-10-11 07:38:40      0         6        6
38:    38 1001 2012-10-24 10:58:16      0         7        6
39:    39 1005 2012-10-28 16:28:39      0         4        4
40:    40 1008 2012-10-30 01:57:52      1         7        3
41:    41 1006 2012-11-04 09:14:35      1         4        2
42:    42 1007 2012-11-11 10:56:59      0         3        3
43:    43 1008 2012-11-13 17:05:58      0         8        4
44:    44 1001 2012-11-17 08:38:36      1         8        6
45:    45 1005 2012-11-26 02:49:51      1         5        4
46:    46 1008 2012-11-26 06:12:53      0         9        4
47:    47 1005 2012-11-29 17:34:43      1         6        5
48:    48 1001 2012-11-29 23:25:36      0         9        7
49:    49 1006 2012-12-14 17:35:57      0         5        3
50:    50 1002 2012-12-19 08:36:07      1         5        4
    Index   ID                Date Status Num_Visit Num_Show

#Generate test data.
test = data.frame(list(ID = sample(1001:1008, size = 50, replace = TRUE)))
test$Date = as.POSIXct(sample(as.POSIXct("2012-01-01"):as.POSIXct("2012-12-31"), size = 50, replace = FALSE), origin = as.POSIXct("1970-01-01"))
test$Status = sample(0:1, size = nrow(test), replace = TRUE, prob = c(0.4, 0.6))
test = test[order(test$Date), ]
test$Index = c(1:nrow(test))

#Compute Num_Visit and Num_Show
test = data.table(test)
setkey(test, "Index")
counts = test[, list(Num_Visit = length(Index[test$Index < Index & test$ID == ID]),
                       Num_Show = length(Index[test$Index < Index & test$ID == ID 
                                               & test$Status == 1])), 
           by = key(dt)]

testWithVisit = test[counts, ]
4

1 回答 1

4

由于data.frame您在counts定义中执行的常规子集化,这很慢。例如

Num_Visit = length(Index[test$Index < Index & test$ID == ID])

这也(我认为......)需要一个数据副本,随着数据变大,这对内存使用很困难。相反,您应该使用data.table操作和cumsum.

test[, list(Date = Date,
            Status = Status,
            Num_Visit = seq_along(.I) - 1,
            Num_Show  = cumsum(Status) -1),
     by=ID]

您还可以使用键对数据进行排序,以便日期按正确的顺序排列。

setkey(test, ID, Date)
于 2013-02-26T16:01:20.937 回答