r - 如何使用 dtplyr / data.table ggplot 而不将其转换为数据框或小标题？

Question

我第一次尝试dtplyr & data.table在我现有的dplyr代码中进行一些时间优化。

问题：如果我使用data.table / dtplyr数据对象，那么我无法使用ggplot进行绘图。在绘制管道/链命令之前，如果我只是将data.table / dtplyr对象转换为tibble ，那么它可以与ggplot一起使用，但它比完全使用data.frame/tibble需要更多的时间，这将在本文后面显示。

library(tidyverse)
library(dtplyr)
library(data.table)
library(scale)
library(lubridate)
library(bench)

我的代码尝试和时间基准：

数据：

data.frame 对象

df_ind_stacked_daily <- read.csv(url("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/df_ind_stacked_daily.csv")) %>% 
  mutate(Date = ymd(Date))

数据表对象

df_ind_stacked_daily2 <- setDT(df_ind_stacked_daily)

使用 data.table/dtplyr 对象绘图：

 df_ind_stacked_daily2 %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() %>% 
    # as.tibble() %>%
    
            ggplot(aes(x = Daily_cases_counts, 
                       y = reorder_within(State.UnionTerritory, 
                                          by = Daily_cases_counts, within = Date),
                       fill = State.UnionTerritory)) +
            geom_col(show.legend = FALSE) +
            facet_wrap(~Date, scales = "free_y") +
            
            geom_text(aes(label = Daily_cases_counts), size=3, color="white", 
                       # position = "dodge", 
                      hjust = 1.2) + 
            
            # theme_minimal() +
            theme(legend.position = "none") +
            scale_x_continuous(labels = comma) + # unit_format(scale = 1e-3, unit = "k")
            scale_fill_tableau(palette = "Tableau 20") +
            scale_y_reordered() +
            coord_cartesian(clip = "off")

错误：data必须是数据框或其他fortify()可强制转换的对象，而不是具有类 dtplyr_step_group/dtplyr_step 的 S3 对象。

PS - 如果我as.tibble()在上面的代码块中取消注释，那么ggplot可以工作。

代码时间基准：

data.table/dtplyr对象而不转换为 tibble

library(bench)

bench::mark(
  df_ind_stacked_daily2 %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() 
    # as.tibble() %>%
)

expression       min    median itr/sec
<S3: bench_expr> 2.45ms 2.75ms 320.3396

转换为tibble后的data.table/dtplyr对象

library(bench)

bench::mark(
  df_ind_stacked_daily2 %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() %>%
    as.tibble()
)

expression       min    median itr/sec
<S3: bench_expr> 12.7ms 14ms   65.41098

data.frame或tibble对象

library(bench)

bench::mark(
  df_ind_stacked_daily %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup()
)

expression       min    median itr/sec
<S3: bench_expr> 6.71ms 7.97ms   120.3636

问题：那么我怎样才能使ggplot与data.table / dtplyr一起工作而不将其转换为data.frame / tibble？

                               ############################

（更新：对答案的回应）

谢谢@teunbrand，我主要使用下面的代码并添加了另一个功能，并将其置于 3 个场景中：

我创建了两个函数：（1）执行处理并且不对 tibble 进行强制，（2）在处理后将其强制为 tibble。

我总共在 3 个场景中运行了这些 - (1) data.table，(2) data.table 在处理后转换为 tibble，(3)从一开始就使用 tibble

# 1. function doesn't convert to tibble 
fun <- function(x) {
  x %>%
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() #%>%
    # as_tibble() # Always coerce to tibble
}

# 2. function convert it to tibble after all processing
fun_to_tbl <- function(x) {
  x %>%
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() %>%
    as_tibble() # Always coerce to tibble
}


# Make data larger
dt  <- do.call(rbind, rep(list(as.data.table(df_ind_stacked_daily)), 20))
tbl_df <- do.call(rbind, rep(list(as_tibble(df_ind_stacked_daily)), 20))

# Run data.table on single thread
setDTthreads(1)

由于未知原因，我的基准测试没有同时运行，所以我不得不一个接一个地运行它们。

(bm <- bench::mark(
  dt_res = fun(dt), # bench dt
  min_iterations = 20
))

expression       min    median itr/sec    mem_alloc
<S3: bench_expr> 4.35ms 6.05ms   148.1923 5.12KB

(bm <- bench::mark(
  dt_to_tbl_res = fun_to_tbl(dt), # bench dt converted to tibble at end
  min_iterations = 20
))

expression       min    median itr/sec    mem_alloc
<S3: bench_expr> 65.8ms 72.2ms   12.28566 47.6MB

(bm <- bench::mark(
  tbl_res =  fun(tbl_df),   # bench tbl
  min_iterations = 20
))

expression       min    median itr/sec  mem_alloc
<S3: bench_expr> 55ms 67.8ms   13.70603 47.4MB

目标：我的主要目标是将其合并到具有动态变量选择的闪亮应用程序中，因此希望使用data.table对其进行优化。但我想ggplot无法使用s3 objects / data.table。

我得到的唯一时差是当我使用data.table并将其作为data.table传递时，否则没有任何好处。

score 2 · Accepted Answer

这里有几点需要注意：

据我了解 dtplyr，沿着您的管道链，它会累积未评估的操作，它们只是从 dplyr 转换为 data.table 语法。在你意识到你的管道是一个 data.frame、data.table 或 tibble 之前，你的计算机不会运行这些操作。这低估了您的第一个基准测试的运行时间。
因为您setDT用于将 data.frame 转换为 data.table，所以您作为 data.frame 进行基准测试的不是 data.frame 的基准。如果您阅读的文档?setDT，您会看到该对象在内存中进行了转换，并且没有复制，这意味着您df_ind_stacked_daily也是一个 data.table。
data.table 包默认使用多个线程。我们应该防止这种情况进行公平比较。
您的第一个过滤操作从中等数据（75748 行）变为小数据（252 行）。在您的大部分管道中，您没有处理大量数据，而这正是 data.table 的亮点。

调整了其中一些，我发现速度上没有区别。

library(tidyverse)
library(dtplyr)
library(data.table)
library(lubridate)
library(bench)

df <- read.csv(url("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/df_ind_stacked_daily.csv")) %>% 
  mutate(Date = ymd(Date))

fun <- function(x) {
  x %>%
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() %>%
    as_tibble() # Always coerce to tibble
}

# Make data larger
dt  <- do.call(rbind, rep(list(as.data.table(df)), 20))
tbl <- do.call(rbind, rep(list(as_tibble(df)), 20))

# Run data.table on single thread
setDTthreads(1)

# Benchmark simultaneously
(bm <- bench::mark(
  dt = fun(dt),
  tbl = fun(tbl),
  min_iterations = 20
))
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dt           41.1ms   42.5ms      23.4    72.2MB     35.2
#> 2 tbl          40.7ms   41.5ms      24.0      71MB     36.0
plot(bm)

^{由reprex 包于 2021-08-19 创建(v1.0.0)}

r - 如何使用 dtplyr / data.table ggplot 而不将其转换为数据框或小标题？

1 回答 1

Related

Reference