r - R 语言 - 将数据分类为范围；平均; 忽略异常值

Question

我正在分析来自风力涡轮机的数据，通常这是我会在 excel 中做的事情，但数据量需要一些繁重的东西。我以前从未使用过 R，所以我只是在寻找一些指针。

数据由两列WindSpeed和Power组成，到目前为止，我已经从 CSV 文件导入数据，并将两者相互散点图。

我接下来要做的是将数据分类为范围；例如，WindSpeed介于 x 和 y 之间的所有数据，然后找到每个范围内产生的功率平均值并绘制形成的曲线图。

根据这个平均值，我想根据落在平均值的两个标准偏差之一内的数据重新计算平均值（基本上忽略异常值）。

任何指针表示赞赏。

对于那些有兴趣的人，我正在尝试创建一个类似于this的图表。它是一种非常标准的图表类型，但就像我说的那样，数据的剪切量需要比 excel 更重的东西。

score 5 · Accepted Answer

既然您不再使用 Excel，为什么不使用不需要粗分数据的现代统计方法和去除异常值的临时方法：由 loess 实现的局部平滑回归。

对 csgillespie 的样本数据稍作修改：

w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)

plot(w_sp, power)

x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)

score 2 · Accepted Answer

首先，我们将创建一些示例数据以使问题具体化：

w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))

假设我们要对power[0,5)、[5,10) 等之间的值进行分箱。然后

bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))

我们现在已经创建了感兴趣范围之间的平均值。请注意，如果您想要中值，只需更改mean为median. 剩下要做的就是绘制它们：

plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)

要根据平均值的两个标准差范围内的数据获得平均值，我们需要创建一个稍微复杂的函数：

noOutliers = function(x, power, w_sp, bin_incr) {
  d = power[w_sp >= x & w_sp < (x + bin_incr)]
  m_d = mean(d)
  d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
  return(mean(d_trim))
}

y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)

score 2 · Accepted Answer

将此版本与@hadley 的动机相似，使用加法模型与使用包的自适应平滑器混合mgcv：

@hadley 使用的首先是虚拟数据

w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)

使用拟合加法模型gam()，通过 REML 使用自适应平滑器和平滑度选择

require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)

从我们的模型中预测并获得拟合的标准误差，使用后者生成大约 95% 的置信区间

x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)

绘制所有内容和黄土适合比较

plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from @hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
      lwd = 3)

自适应平滑和黄土拟合

score 1 · Accepted Answer

以下是商用涡轮机的拟合曲线（威布尔分析）的一些示例：

http://www.inl.gov/wind/software/

http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf

http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf

score 0 · Accepted Answer

我建议也使用 Hadley 自己的 ggplot2。他的网站是一个很好的资源：http ://had.co.nz/ggplot2/ 。

    # If you haven't already installed ggplot2:
    install.pacakges("ggplot2", dependencies = T)

    # Load the ggplot2 package
    require(ggplot2)

    # csgillespie's example data
    w_sp <- sample(seq(0, 100, 0.01), 1000)
    power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)

    # Bind the two variables into a data frame, which ggplot prefers
    wind <- data.frame(w_sp = w_sp, power = power)

    # Take a look at how the first few rows look, just for fun
    head(wind)


    # Create a simple plot
    ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()

    # Create a slightly more complicated plot as an example of how to fine tune
    # plots in ggplot
    p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
    p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot") 
    p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
    p3 + scale_x_continuous(name = "mph") + 
             scale_y_continuous(name = "power") +
             opts(title = "Wind speed and power")

r - R 语言 - 将数据分类为范围；平均; 忽略异常值

5 回答 5

Related

Reference