1

我有 10 年内 10 个地点的每日降雨量数据

set.seed(123)

df <- data.frame(loc.id = rep(1:10, each = 10*365),years = rep(rep(2001:2010,each = 365),times = 10),
             day = rep(rep(1:365,times = 10),times = 10), rain = runif(min = 0 , max = 35, 10*10*365))

我有一个单独的数据框,它有特定的日子,我想用它来总结降雨量df

df.ref <- data.frame(loc.id = rep(1:10, each = 10), 
                 years = rep(2001:2010,times = 10),
                 index1 = rep(250,times = 10*10),
                 index2 = sample(260:270, size = 10*10,replace = T),
                 index3 = sample(280:290, size = 10*10,replace = T),
                 index4 = sample(291:300, size= 10*10,replace = T))

df.ref

    loc.id years index1 index2 index3 index4
1:      1  2001    250    264    280    296
2:      1  2002    250    269    284    298
3:      1  2003    250    268    289    293
4:      1  2004    250    266    281    295
5:      1  2005    250    260    289    293

我想要的是 in 中的行df.ref,使用 in 中的indexdf.ref并将dfindex1 到 index2、index1 到 index3 和 index1 到 index4 之间的降雨量相加。例如:

使用df.ref,对于 loc.id = 1 和 year == 2001,将df250 到 264、250 到 280、250 到 296 的降雨量相加(如图所示df.ref) 同样,对于 2002 年,对于 loc.id = 1,求和降雨量从 250 到 269、250 到 284、250 到 298。

我这样做了:

library(dplyr)  

ptm <- proc.time()

dat <- df.ref  %>% left_join(df)

index1.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index2) %>% summarise(sum.rain1  = sum(rain))

index2.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index3) %>% summarise(sum.rain2 = sum(rain))

index3.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index4) %>% summarise(sum.rain3 = sum(rain))

all.index <- index1.cal %>% left_join(index2.cal) %>% left_join(index3.cal))

 proc.time() - ptm

user  system elapsed 
2.36    0.64    3.06 

我希望使我的代码更快,因为我的实际代码df.ref很大。谁能告诉我如何让这个更快。

4

2 回答 2

3

包中的非 equi 连接data.table可以比dplyr::left_join幻灯片|视频) 更快且内存效率更高

对于中的每个值df,找出介于和之间的所有rain值。然后根据和计算总和。df.refdayindex 1index 2rainloc.idyears

df1 <- unique(df[df.ref
                , .(rain)
                , on = .(loc.id, years, day >= index1, day <= index2)
                , by = .EACHI][
                  ][
                  , c("sum_1") := .(sum(rain)), by = .(loc.id, years)][
                  # remove all redundant columns
                  , day := NULL][
                  , day := NULL][
                  , rain := NULL])

df2 <- unique(df[df.ref
                , .(rain)
                , on = .(loc.id, years, day >= index1, day <= index3)
                , by = .EACHI][
                  ][
                  , c("sum_2") := .(sum(rain)), by = .(loc.id, years)][
                  , day := NULL][
                  , day := NULL][
                  , rain := NULL])

df3 <- unique(df[df.ref
                , .(rain)
                , on = .(loc.id, years, day >= index1, day <= index4)
                , by = .EACHI][
                  ][
                  , c("sum_3") := .(sum(rain)), by = .(loc.id, years)][
                  , day := NULL][
                  , day := NULL][
                  , rain := NULL])

将所有三个 data.table 合并在一起

df1[df2, on = .(loc.id, years)][
  df3, on = .(loc.id, years)]

     loc.id years     sum_1    sum_2    sum_3
  1:      1  1950 104159.11 222345.4 271587.1
  2:      1  1951 118689.90 257450.2 347624.3
  3:      1  1952  99262.27 212923.7 280877.6
  4:      1  1953  72435.50 192072.7 251593.6
  5:      1  1954 104021.19 242525.3 326463.4
  6:      1  1955  93436.32 232653.1 304921.4
  7:      1  1956  89122.79 190424.4 255535.0
  8:      1  1957 135658.11 262918.7 346361.4
  9:      1  1958  80064.18 220454.8 292966.4
 10:      1  1959 114231.19 273181.0 349489.2
 11:      2  1950  94360.69 238296.8 301751.8
 12:      2  1951  93845.50 195273.7 289686.0
 13:      2  1952 107692.53 245019.4 308093.7
 14:      2  1953  86650.14 257225.1 332674.1
 15:      2  1954 104085.83 238859.4 286350.7
 16:      2  1955 101602.16 223107.3 300958.4
 17:      2  1956  73912.77 198087.2 276590.1
 18:      2  1957 117780.86 228299.8 305348.5
 19:      2  1958  98625.45 220902.6 291583.7
 20:      2  1959 109851.38 266745.2 324246.8
 [ reached getOption("max.print") -- omitted 81 rows ]

比较处理时间和使用的内存

> time_dplyr; time_datatable
   user  system elapsed 
   2.17    0.27    2.61 
   user  system elapsed 
   0.45    0.00    0.69 

  rowname      Class  MB
1     dat data.frame 508
2     df3 data.table  26
3     df2 data.table  20
4     df1 data.table   9

在测试大约 100 年的数据时,dplyr使用了超过 50 GB 的内存,而data.table仅消耗了 5 GB。dplyr完成的时间也延长了大约 4 倍。

'data.frame':   3650000 obs. of  4 variables:
 $ loc.id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ years : int  1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
 $ day   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ rain  : num  10.1 27.6 14.3 30.9 32.9 ...
'data.frame':   3650000 obs. of  6 variables:
 $ loc.id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ years : int  1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 ...
 $ index1: num  250 250 250 250 250 250 250 250 250 250 ...
 $ index2: int  270 265 262 267 266 265 262 268 260 268 ...
 $ index3: int  290 287 286 289 281 285 286 285 284 283 ...
 $ index4: int  298 297 296 295 298 294 296 298 298 300 ...

> time_dplyr; time_datatable
   user  system elapsed
 95.010  33.704 128.722
   user  system elapsed
 26.175   3.147  29.312

  rowname      Class    MB
1     dat data.frame 50821
2     df3 data.table  2588
3     df2 data.table  2004
4     df1 data.table   888
5  df.ref data.table    97
6      df data.table    70

如果我将年数增加到 150,dplyr则在具有 256 GB RAM 的 HPC 集群节点上实现收支平衡

Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) :
  negative length vectors are not allowed
Calls: %>% ... left_join -> left_join.tbl_df -> left_join_impl -> .Call
Execution halted
于 2018-03-31T02:36:20.170 回答
2

这是一个更快的起点。弄清楚其余的应该是微不足道的。

library(data.table)
setDT(df)

df[df.ref, on = .(loc.id, years, day >= index1, day <= index2), sum(rain), by = .EACHI]
于 2018-03-29T15:58:44.073 回答