r - dplyr summarise() 具有来自单个函数的多个返回值

Question

我想知道是否有一种方法可以使用带有summarise( dplyr 0.1.2) 的函数来返回多个值（例如包中的describe函数psych）。

如果没有，是因为它尚未实施，还是有理由认为它不是一个好主意？

例子：

require(psych)
require(ggplot2)
require(dplyr)

dgrp <- group_by(diamonds, cut)
describe(dgrp$price)
summarise(dgrp, describe(price))

产生：Error: expecting a single value

score 45 · Accepted Answer

使用dplyr>= 0.2 我们可以使用do函数：

library(ggplot2)
library(psych)
library(dplyr)
diamonds %>%
    group_by(cut) %>%
    do(describe(.$price)) %>%
    select(-vars)
#> Source: local data frame [5 x 13]
#> Groups: cut [5]
#> 
#>         cut     n     mean       sd median  trimmed      mad   min   max range     skew kurtosis       se
#>      (fctr) (dbl)    (dbl)    (dbl)  (dbl)    (dbl)    (dbl) (dbl) (dbl) (dbl)    (dbl)    (dbl)    (dbl)
#> 1      Fair  1610 4358.758 3560.387 3282.0 3695.648 2183.128   337 18574 18237 1.780213 3.067175 88.73281
#> 2      Good  4906 3928.864 3681.590 3050.5 3251.506 2853.264   327 18788 18461 1.721943 3.042550 52.56197
#> 3 Very Good 12082 3981.760 3935.862 2648.0 3243.217 2855.488   336 18818 18482 1.595341 2.235873 35.80721
#> 4   Premium 13791 4584.258 4349.205 3185.0 3822.231 3371.432   326 18823 18497 1.333358 1.072295 37.03497
#> 5     Ideal 21551 3457.542 3808.401 1810.0 2656.136 1630.860   326 18806 18480 1.835587 2.977425 25.94233

基于purrr（purrrlyr自 2017 年以来）包的解决方案：

library(ggplot2)
library(psych)
library(purrr)
diamonds %>% 
    slice_rows("cut") %>% 
    by_slice(~ describe(.x$price), .collate = "rows")
#> Source: local data frame [5 x 14]
#> 
#>         cut  vars     n     mean       sd median  trimmed      mad   min   max range     skew kurtosis       se
#>      (fctr) (dbl) (dbl)    (dbl)    (dbl)  (dbl)    (dbl)    (dbl) (dbl) (dbl) (dbl)    (dbl)    (dbl)    (dbl)
#> 1      Fair     1  1610 4358.758 3560.387 3282.0 3695.648 2183.128   337 18574 18237 1.780213 3.067175 88.73281
#> 2      Good     1  4906 3928.864 3681.590 3050.5 3251.506 2853.264   327 18788 18461 1.721943 3.042550 52.56197
#> 3 Very Good     1 12082 3981.760 3935.862 2648.0 3243.217 2855.488   336 18818 18482 1.595341 2.235873 35.80721
#> 4   Premium     1 13791 4584.258 4349.205 3185.0 3822.231 3371.432   326 18823 18497 1.333358 1.072295 37.03497
#> 5     Ideal     1 21551 3457.542 3808.401 1810.0 2656.136 1630.860   326 18806 18480 1.835587 2.977425 25.94233

但这很简单data.table：

as.data.table(diamonds)[, describe(price), by = cut]
#>          cut vars     n     mean       sd median  trimmed      mad min   max range     skew kurtosis       se
#> 1:     Ideal    1 21551 3457.542 3808.401 1810.0 2656.136 1630.860 326 18806 18480 1.835587 2.977425 25.94233
#> 2:   Premium    1 13791 4584.258 4349.205 3185.0 3822.231 3371.432 326 18823 18497 1.333358 1.072295 37.03497
#> 3:      Good    1  4906 3928.864 3681.590 3050.5 3251.506 2853.264 327 18788 18461 1.721943 3.042550 52.56197
#> 4: Very Good    1 12082 3981.760 3935.862 2648.0 3243.217 2855.488 336 18818 18482 1.595341 2.235873 35.80721
#> 5:      Fair    1  1610 4358.758 3560.387 3282.0 3695.648 2183.128 337 18574 18237 1.780213 3.067175 88.73281

我们可以编写自己的摘要函数，它返回一个列表：

fun <- function(x) {
    list(n = length(x),
         min = min(x),
         median = as.numeric(median(x)),
         mean = mean(x),
         sd = sd(x),
         max = max(x))
}
as.data.table(diamonds)[, fun(price), by = cut]
#>          cut     n min median     mean       sd   max
#> 1:     Ideal 21551 326 1810.0 3457.542 3808.401 18806
#> 2:   Premium 13791 326 3185.0 4584.258 4349.205 18823
#> 3:      Good  4906 327 3050.5 3928.864 3681.590 18788
#> 4: Very Good 12082 336 2648.0 3981.760 3935.862 18818
#> 5:      Fair  1610 337 3282.0 4358.758 3560.387 18574

score 8 · Accepted Answer

在最新版本的 tidyverse 中，这是可能的。

首先，在您提供的示例中，该函数返回一个单行数据框。如果我们在中使用这样的函数summarize()，它会生成一个数据框列，我们可以通过将其转换为单独的列unpack()。

library(tidyverse)
library(psych)

describe(diamonds$price)
#>    vars     n   mean      sd median trimmed     mad min   max range skew
#> X1    1 53940 3932.8 3989.44   2401 3158.99 2475.94 326 18823 18497 1.62
#>    kurtosis    se
#> X1     2.18 17.18

diamonds %>%
  group_by(cut) %>%
  summarize(descr = describe(price)) %>%
  unpack(cols = descr)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 5 x 14
#>   cut    vars     n  mean    sd median trimmed   mad   min   max range  skew
#>   <ord> <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair      1  1610 4359. 3560.  3282    3696. 2183.   337 18574 18237  1.78
#> 2 Good      1  4906 3929. 3682.  3050.   3252. 2853.   327 18788 18461  1.72
#> 3 Very…     1 12082 3982. 3936.  2648    3243. 2855.   336 18818 18482  1.60
#> 4 Prem…     1 13791 4584. 4349.  3185    3822. 3371.   326 18823 18497  1.33
#> 5 Ideal     1 21551 3458. 3808.  1810    2656. 1631.   326 18806 18480  1.84
#> # … with 2 more variables: kurtosis <dbl>, se <dbl>

其次，在某些情况下，函数只是简单地返回一个向量作为输出。在这些情况下，summarize()每个生成的值生成一个新行。

set.seed(1234)
dsmall <- diamonds[sample(nrow(diamonds), 25), ]

unique(dsmall$clarity)
#> [1] I1   SI2  VVS2 VS1  VVS1 VS2  SI1  IF  
#> Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF

dsmall %>%
  group_by(cut) %>%
  summarize(clarity = unique(clarity))
#> `summarise()` regrouping output by 'cut' (override with `.groups` argument)
#> # A tibble: 17 x 2
#> # Groups:   cut [4]
#>    cut       clarity
#>    <ord>     <ord>  
#>  1 Good      I1     
#>  2 Good      SI2    
#>  3 Good      VS1    
#>  4 Good      SI1    
#>  5 Very Good VVS2   
#>  6 Very Good SI2    
#>  7 Very Good VS1    
#>  8 Very Good IF     
#>  9 Premium   SI2    
#> 10 Premium   SI1    
#> 11 Ideal     VS1    
#> 12 Ideal     VVS1   
#> 13 Ideal     VS2    
#> 14 Ideal     VVS2   
#> 15 Ideal     SI1    
#> 16 Ideal     SI2    
#> 17 Ideal     IF

^{由reprex 包于 2020-07-14 创建(v0.3.0)}

r - dplyr summarise() 具有来自单个函数的多个返回值

2 回答 2

Related

Reference