9

这个问题的答案清楚地解释了如何在通过 dplyr 管道运行回归时按组检索整齐的回归结果,但解决方案不再可重现。

如何结合使用 dplyr 和 broom 来按组运行回归并使用 R 4.02、dplyr 1.0.0 和 broom 0.7.0 检索整齐的结果?

具体来说,上面链接的问题的示例答案,

library(dplyr)
library(broom)

df.h = data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

dfHour = df.h %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind + temp, data = .))

# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)

当我在我的系统上运行它时返回以下错误(和三个警告):

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : 
  Calling var(x) on a factor x is defunct.
  Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom. 
2: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA

如果我重新格式化df.h$hour为一个字符而不是因子,

df.h <- df.h %>%
  mutate(
    hour = as.character(hour)
  )

按组重新运行回归,并再次尝试使用检索结果broom::tidy

dfHour = df.h %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind + temp, data = .))

# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)

我收到此错误:

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : 
  is.atomic(x) is not TRUE

我认为问题与组级回归结果作为列表存储在 中的事实有关dfHour$fitHour,但我不确定如何纠正错误并再次整齐快速地编译回归结果,就像在最初发布的代码/答案。

4

3 回答 3

7

****** 更新了从 dplyr 1.0.0 发行说明中提取的更简洁的代码 ******

谢谢你。我在与使用提供的链接中的示例相关的 dplyr 1.0.0 更新中遇到了类似的问题。这是一个有用的问题和答案。

仅供参考,do() 自 dplyr 1.0.0 起已被取代,因此可以考虑使用更新的语言(现在我的更新非常有效):

dfHour = df.h %>% 
  # replace group_by() with nest_by() 
  # to convert your model data to a vector of lists
  nest_by(hour) %>%
  # change do() to mutate(), then add list() before your model
  # make sure to change data = .  to data = data
  mutate(fitHour = list(lm(price ~ wind + temp, data = data))) %>%
  summarise(tidy(mod))

完毕!

这提供了一个非常有效的 df 和选择输出统计信息。最后一行替换了以下代码(来自我的原始响应),它做同样的事情,但不太容易:

ungroup() %>%
  # then leverage the feedback from @akrun
  transmute(hour, HourCoef = map(fitHour, tidy)) %>%
  unnest(HourCoef)

dfHour

这给出了输出:

# A tibble: 72 x 6
   hour  term         estimate std.error statistic  p.value
   <fct> <chr>           <dbl>     <dbl>     <dbl>    <dbl>
 1 1     (Intercept) 68.6        21.0       3.27   0.00428 
 2 1     wind         0.000558    0.0124    0.0450 0.965   
 3 1     temp        -0.866       0.907    -0.954  0.353   
 4 2     (Intercept) 31.9        17.4       1.83   0.0832  
 5 2     wind         0.00950     0.0113    0.838  0.413   
 6 2     temp         1.69        0.802     2.11   0.0490  
 7 3     (Intercept) 85.5        22.3       3.83   0.00122 
 8 3     wind        -0.0210      0.0165   -1.27   0.220   
 9 3     temp         0.276       1.14      0.243  0.811   
10 4     (Intercept) 73.3        15.1       4.86   0.000126
# ... with 62 more rows

感谢您的耐心等待,我自己正在解决这个问题!

于 2020-08-04T21:34:05.683 回答
5

问题是调用rowwise后有一个分组属性,do并且“fitHour”列是list. 我们可以,将with和itungroup循环到一列listmaptidylist

library(dplyr)
library(purrr)
library(broom)
df.h %>% 
     group_by(hour) %>%
     do(fitHour = lm(price ~ wind + temp, data = .)) %>% 
     ungroup %>% 
     mutate(HourCoef = map(fitHour, tidy))

或使用unnestmtuate

df.h %>% 
      group_by(hour) %>%
      do(fitHour = lm(price ~ wind + temp, data = .)) %>% 
      ungroup %>% 
      transmute(hour, HourCoef = map(fitHour, tidy)) %>% 
      unnest(HourCoef)
# A tibble: 72 x 6
#   hour  term        estimate std.error statistic  p.value
#   <fct> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
# 1 1     (Intercept) 89.8       20.2       4.45   0.000308
# 2 1     wind         0.00493    0.0151    0.326  0.748   
# 3 1     temp        -1.84       1.08     -1.71   0.105   
# 4 2     (Intercept) 75.6       23.7       3.20   0.00500 
# 5 2     wind        -0.00910    0.0146   -0.622  0.542   
# 6 2     temp         0.192      0.853     0.225  0.824   
# 7 3     (Intercept) 44.0       23.9       1.84   0.0822  
# 8 3     wind        -0.00158    0.0166   -0.0953 0.925   
# 9 3     temp         0.622      1.19      0.520  0.609   
#10 4     (Intercept) 57.8       18.9       3.06   0.00676 
# … with 62 more rows

如果我们想要一个单一的数据集,pull“fitHour”,遍历listwith map,通过行绑定(后缀_dfr)将其压缩为一个单一的数据集

df.h %>%
    group_by(hour) %>%  
    do(fitHour = lm(price ~ wind + temp, data = .)) %>% 
    ungroup %>% 
    pull(fitHour) %>% 
    map_dfr(tidy, .id = 'grp')

注意:OP 的错误消息可以用R 4.02,dplyr 1.0.0broom 0.7.0

tidy(dfHour,fitHour)

var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) 中的错误:在因子 x 上调用 var(x) 已失效. 使用类似 'all(duplicated(x)[-1L])' 的东西来测试一个常数向量。另外:警告消息:1:不推荐使用数据框整理器,并将在即将发布的 broom 中删除。2:在 mean.default(X[[i]], ...) 中:

于 2020-07-18T19:55:35.853 回答
0

您的代码确实有效。也许包版本或重新开始一个新R会话可能会有所帮助:

library(dplyr)
library(broom)

df.h = data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

dfHour = df.h %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind + temp, data = .))

tidy(dfHour,fitHour)

# A tibble: 72 x 6
# Groups:   hour [24]
   hour  term         estimate std.error statistic   p.value
   <fct> <chr>           <dbl>     <dbl>     <dbl>     <dbl>
 1 1     (Intercept) 66.4       14.8        4.48   0.000288 
 2 1     wind         0.000474   0.00984    0.0482 0.962    
 3 1     temp         0.0691     0.945      0.0731 0.943    
 4 2     (Intercept) 66.5       20.4        3.26   0.00432  
 5 2     wind        -0.00540    0.0127    -0.426  0.675    
 6 2     temp        -0.306      0.944     -0.324  0.750    
 7 3     (Intercept) 86.5       17.3        5.00   0.0000936
 8 3     wind        -0.0119     0.00960   -1.24   0.232    
 9 3     temp        -1.18       0.928     -1.27   0.221    
10 4     (Intercept) 59.8       17.5        3.42   0.00304  
# ... with 62 more rows
于 2020-07-18T20:05:14.230 回答