我有大量的历史客人约会记录(> 1 百万条目)。每行记录客人的ID,约会的日期,约会的状态(1表示出现,0表示未出现)示例如下(测试)。
我需要计算每位客人在当前约会之前进行的约会次数。我按日期升序对数据进行了排序。
我尝试使用 data.table() 进行计算。计算结果的示例如下所示为 testWithVisit。我使用的方法适用于小型数据表。但是对于超过 10000 个结果的数据表来说速度很慢。我无法完成所有 1 百万行的计算。我可以知道是否有人对此有一个优雅的解决方案?提前致谢。
我用来生成测试数据和计算 testWithVisit 的代码在底部。
> test
ID Date Status Index
1: 1002 2012-01-11 03:46:27 1 1
2: 1001 2012-02-17 10:15:59 1 2
3: 1002 2012-02-26 13:18:42 1 3
4: 1001 2012-02-27 18:48:00 1 4
5: 1004 2012-03-11 05:40:36 1 5
6: 1004 2012-03-17 06:06:05 0 6
7: 1008 2012-03-17 14:41:53 0 7
8: 1008 2012-03-21 13:55:51 1 8
9: 1008 2012-03-22 22:30:42 0 9
10: 1005 2012-03-29 09:00:39 1 10
11: 1005 2012-04-04 02:46:54 1 11
12: 1004 2012-04-05 22:53:05 1 12
13: 1006 2012-04-11 19:53:10 0 13
14: 1007 2012-04-14 17:19:07 1 14
15: 1003 2012-04-16 08:28:26 1 15
16: 1007 2012-04-16 19:26:57 1 16
17: 1001 2012-04-17 15:43:26 1 17
18: 1008 2012-04-21 07:12:20 0 18
19: 1004 2012-04-26 06:44:01 0 19
20: 1001 2012-05-10 13:17:56 1 20
21: 1005 2012-05-10 18:56:17 1 21
22: 1008 2012-05-11 08:58:28 1 22
23: 1001 2012-05-16 08:20:22 1 23
24: 1003 2012-06-06 04:15:58 1 24
25: 1006 2012-06-11 12:01:15 1 25
26: 1008 2012-06-20 14:06:22 1 26
27: 1002 2012-06-21 05:18:20 1 27
28: 1008 2012-06-29 16:07:28 0 28
29: 1002 2012-07-02 09:42:15 1 29
30: 1005 2012-07-06 22:45:24 1 30
31: 1007 2012-07-08 01:51:51 1 31
32: 1001 2012-08-12 07:04:49 1 32
33: 1006 2012-08-29 04:09:09 1 33
34: 1006 2012-09-25 19:37:58 0 34
35: 1003 2012-10-07 06:20:29 0 35
36: 1002 2012-10-08 19:16:35 0 36
37: 1001 2012-10-11 07:38:40 0 37
38: 1001 2012-10-24 10:58:16 0 38
39: 1005 2012-10-28 16:28:39 0 39
40: 1008 2012-10-30 01:57:52 1 40
41: 1006 2012-11-04 09:14:35 1 41
42: 1007 2012-11-11 10:56:59 0 42
43: 1008 2012-11-13 17:05:58 0 43
44: 1001 2012-11-17 08:38:36 1 44
45: 1005 2012-11-26 02:49:51 1 45
46: 1008 2012-11-26 06:12:53 0 46
47: 1005 2012-11-29 17:34:43 1 47
48: 1001 2012-11-29 23:25:36 0 48
49: 1006 2012-12-14 17:35:57 0 49
50: 1002 2012-12-19 08:36:07 1 50
ID Date Status Index
> testWithVisit
Index ID Date Status Num_Visit Num_Show
1: 1 1002 2012-01-11 03:46:27 1 0 0
2: 2 1001 2012-02-17 10:15:59 1 0 0
3: 3 1002 2012-02-26 13:18:42 1 1 1
4: 4 1001 2012-02-27 18:48:00 1 1 1
5: 5 1004 2012-03-11 05:40:36 1 0 0
6: 6 1004 2012-03-17 06:06:05 0 1 1
7: 7 1008 2012-03-17 14:41:53 0 0 0
8: 8 1008 2012-03-21 13:55:51 1 1 0
9: 9 1008 2012-03-22 22:30:42 0 2 1
10: 10 1005 2012-03-29 09:00:39 1 0 0
11: 11 1005 2012-04-04 02:46:54 1 1 1
12: 12 1004 2012-04-05 22:53:05 1 2 1
13: 13 1006 2012-04-11 19:53:10 0 0 0
14: 14 1007 2012-04-14 17:19:07 1 0 0
15: 15 1003 2012-04-16 08:28:26 1 0 0
16: 16 1007 2012-04-16 19:26:57 1 1 1
17: 17 1001 2012-04-17 15:43:26 1 2 2
18: 18 1008 2012-04-21 07:12:20 0 3 1
19: 19 1004 2012-04-26 06:44:01 0 3 2
20: 20 1001 2012-05-10 13:17:56 1 3 3
21: 21 1005 2012-05-10 18:56:17 1 2 2
22: 22 1008 2012-05-11 08:58:28 1 4 1
23: 23 1001 2012-05-16 08:20:22 1 4 4
24: 24 1003 2012-06-06 04:15:58 1 1 1
25: 25 1006 2012-06-11 12:01:15 1 1 0
26: 26 1008 2012-06-20 14:06:22 1 5 2
27: 27 1002 2012-06-21 05:18:20 1 2 2
28: 28 1008 2012-06-29 16:07:28 0 6 3
29: 29 1002 2012-07-02 09:42:15 1 3 3
30: 30 1005 2012-07-06 22:45:24 1 3 3
31: 31 1007 2012-07-08 01:51:51 1 2 2
32: 32 1001 2012-08-12 07:04:49 1 5 5
33: 33 1006 2012-08-29 04:09:09 1 2 1
34: 34 1006 2012-09-25 19:37:58 0 3 2
35: 35 1003 2012-10-07 06:20:29 0 2 2
36: 36 1002 2012-10-08 19:16:35 0 4 4
37: 37 1001 2012-10-11 07:38:40 0 6 6
38: 38 1001 2012-10-24 10:58:16 0 7 6
39: 39 1005 2012-10-28 16:28:39 0 4 4
40: 40 1008 2012-10-30 01:57:52 1 7 3
41: 41 1006 2012-11-04 09:14:35 1 4 2
42: 42 1007 2012-11-11 10:56:59 0 3 3
43: 43 1008 2012-11-13 17:05:58 0 8 4
44: 44 1001 2012-11-17 08:38:36 1 8 6
45: 45 1005 2012-11-26 02:49:51 1 5 4
46: 46 1008 2012-11-26 06:12:53 0 9 4
47: 47 1005 2012-11-29 17:34:43 1 6 5
48: 48 1001 2012-11-29 23:25:36 0 9 7
49: 49 1006 2012-12-14 17:35:57 0 5 3
50: 50 1002 2012-12-19 08:36:07 1 5 4
Index ID Date Status Num_Visit Num_Show
#Generate test data.
test = data.frame(list(ID = sample(1001:1008, size = 50, replace = TRUE)))
test$Date = as.POSIXct(sample(as.POSIXct("2012-01-01"):as.POSIXct("2012-12-31"), size = 50, replace = FALSE), origin = as.POSIXct("1970-01-01"))
test$Status = sample(0:1, size = nrow(test), replace = TRUE, prob = c(0.4, 0.6))
test = test[order(test$Date), ]
test$Index = c(1:nrow(test))
#Compute Num_Visit and Num_Show
test = data.table(test)
setkey(test, "Index")
counts = test[, list(Num_Visit = length(Index[test$Index < Index & test$ID == ID]),
Num_Show = length(Index[test$Index < Index & test$ID == ID
& test$Status == 1])),
by = key(dt)]
testWithVisit = test[counts, ]