r - 在 R 中有效地使用长数据帧上的函数

Question

我有一个很长的数据框，其中包含来自桅杆的气象数据。它包含data$value不同参数（风速、风向、气温等）data$param在不同高度（data$z

我试图通过有效地分割这些数据$time，然后将函数应用于收集的所有数据。通常函数一次应用于$param一个（即我对风速应用不同的函数，而不是对空气温度应用不同的函数）。

目前的方法

我目前的方法是使用data.frameand ddply。

如果我想获得所有的风速数据，我运行这个：

# find good data ----
df <- data[((data$param == "wind speed") &
                  !is.na(data$value)),]

df然后我在using上运行我的函数ddply()：

df.tav <- ddply(df,
               .(time),
               function(x) {
                      y <-data.frame(V1 = sum(x$value) + sum(x$z),
                                     V2 = sum(x$value) / sum(x$z))
                      return(y)
                    })

通常 V1 和 V2 是对其他函数的调用。这些只是例子。我确实需要在相同的数据上运行多个函数。

问题

我目前的方法很慢。我没有对它进行基准测试，但是它足够慢，我可以去喝杯咖啡，然后在处理一年的数据之前回来。

我有订单（一百个）要处理的塔，每个都有一年的数据和 10-12 个高度，所以我正在寻找更快的东西。

数据样本

data <-  structure(list(time = structure(c(1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 
1262305200), class = c("POSIXct", "POSIXt"), tzone = ""), z = c(0, 
0, 0, 100, 100, 100, 120, 120, 120, 140, 140, 140, 160, 160, 
160, 180, 180, 180, 200, 200, 200, 40, 40, 40, 50, 50, 50, 60, 
60, 60, 80, 80, 80, 0, 0, 0, 100, 100, 100, 120), param = c("temperature", 
"humidity", "barometric pressure", "wind direction", "turbulence", 
"wind speed", "wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"temperature", "barometric pressure", "humidity", "wind direction", 
"wind speed", "turbulence", "wind direction"), value = c(-2.5, 
41, 816.9, 248.4, 0.11, 4.63, 249.8, 0.28, 4.37, 255.5, 0.32, 
4.35, 252.4, 0.77, 5.08, 248.4, 0.65, 3.88, 313, 0.94, 6.35, 
250.9, 0.1, 4.75, 253.3, 0.11, 4.68, 255.8, 0.1, 4.78, 254.9, 
0.11, 4.7, -3.3, 816.9, 42, 253.2, 2.18, 0.27, 229.5)), .Names = c("time", 
"z", "param", "value"), row.names = c(NA, 40L), class = "data.frame")

score 14 · Accepted Answer

使用data.table：

library(data.table)
dt = data.table(data)

setkey(dt, param)  # sort by param to look it up fast

dt[J('wind speed')][!is.na(value),
                    list(sum(value) + sum(z), sum(value)/sum(z)),
                    by = time]
#                  time      V1         V2
#1: 2009-12-31 18:10:00 1177.57 0.04209735
#2: 2009-12-31 18:20:00  102.18 0.02180000

如果您想为每个参数应用不同的功能，这里有一个更统一的方法。

# make dt smaller because I'm lazy
dt = dt[param %in% c('wind direction', 'wind speed')]

# now let's start - create another data.table
# that will have param and corresponding function
fns = data.table(p = c('wind direction', 'wind speed'),
                 fn = c(quote(sum(value) + sum(z)), quote(sum(value) / sum(z))),
                 key = 'p')
fns
                p     fn
1: wind direction <call>    # the fn column contains functions
2:     wind speed <call>    # i.e. this is getting fancy!

# now we can evaluate different functions for different params,
# sliced by param and time
dt[!is.na(value), {param; eval(fns[J(param)]$fn[[1]], .SD)},
   by = list(param, time)]
#            param                time           V1
#1: wind direction 2009-12-31 18:10:00 3.712400e+03
#2: wind direction 2009-12-31 18:20:00 7.027000e+02
#3:     wind speed 2009-12-31 18:10:00 4.209735e-02
#4:     wind speed 2009-12-31 18:20:00 2.180000e-02

PS我认为我必须以param某种方式eval使用eval才能工作的事实是一个错误。

更新：从1.8.11 版开始，此错误已得到修复，并且以下工作：

dt[!is.na(value), eval(fns[J(param)]$fn[[1]], .SD), by = list(param, time)]

score 9 · Accepted Answer

使用 dplyr。它仍在开发中，但比 plyr 快得多：

# devtools::install_github(dplyr)
library(dplyr)

windspeed <- subset(data, param == "wind speed")
daily <- group_by(windspeed, time)

summarise(daily, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))

dplyr 的另一个优点是您可以将数据表用作后端，而无需了解有关 data.table 的特殊语法的任何信息：

library(data.table)
daily_dt <- group_by(data.table(windspeed), time)
summarise(daily_dt, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))

（带有数据帧的 dplyr 比 plyr 快 20-100 倍，带有 data.table 的 dplyr 大约快 10 倍）。dplyr 远不如 data.table 简洁，但它对数据分析的每个主要任务都有一个功能，我发现这使代码更容易理解——你的速度几乎可以将一系列 dplyr 操作读给其他人，并且让他们了解发生了什么。

如果您想对每个变量进行不同的汇总，我建议您将数据结构更改为“整洁”：

library(reshape2)
data_tidy <- dcast(data, ... ~ param)

daily_tidy <- group_by(data_tidy, time)
summarise(daily_tidy, 
  mean.pressure = mean(`barometric pressure`, na.rm = TRUE),
  sd.turbulence = sd(`barometric pressure`, na.rm = TRUE)
)

r - 在 R 中有效地使用长数据帧上的函数

目前的方法

问题

数据样本

2 回答 2

Related

Reference