1

我正在尝试从一个表中选择行(“位置”),其中特定列(“位置”)的值落在另一个(“my_ranges”)表中定义的范围内,然后从“我的范围”表。

我可以使用 tibbles 和几个purrr::map2调用来做到这一点,但同样的方法不适用于 dbplyr database-tibbles。这是预期的行为吗?如果是这样,我应该采取不同的方法来使用 dbplyr 来完成此类任务吗?

这是我的例子:

library("tidyverse")
set.seed(42)

my_ranges <-
  tibble(
    group_id = c("a", "b", "c", "d"),
    start = c(1, 7, 2, 25),
    end = c(5, 23, 7, 29)
    )

positions <-
  tibble(
    position = as.integer(runif(n = 100, min = 0, max = 30)),
    annotation = stringi::stri_rand_strings(n = 100, length = 10)
  )

# note: this works as I expect and returns a tibble with 106 obs of 3 variables:
result <- map2(.x = my_ranges$start, .y = my_ranges$end,
             .f = function(x, y) {between(positions$position, x, y)}) %>%
  map2(.y = my_ranges$group_id,
              .f = function(x, y){
                positions %>%
                  filter(x) %>%
                  mutate(group_id = y)}
) %>% bind_rows()

# next, make an in-memory db for testing:
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")

# copy data to db
copy_to(con, my_ranges, "my_ranges", temporary = FALSE)
copy_to(con, positions, "positions", temporary = FALSE)

# get db-backed tibbles:
my_ranges_db <- tbl(con, "my_ranges")
positions_db <- tbl(con, "positions")

# note: this does not work as I expect, and instead returns a tibble with 0 obsevations of 0 variables:
# database range-based query:
db_result <- map2(.x = my_ranges_db$start, .y = my_ranges_db$end,
                  .f = function(x, y) {
                    between(positions_db$position, x, y)
                    }) %>%
  map2(.y = my_ranges_db$group_id,
       .f = function(x, y){
         positions_db %>%
           filter(x) %>%
           mutate(group_id = y)}
  ) %>% bind_rows()
4

2 回答 2

4

只要每次迭代创建一个相同维度的表,那么可能有一种巧妙的方式将整个操作推送到数据库。这个想法是同时使用map()reduce()from purrr。每个tbl_sql()操作都是惰性的,因此我们可以遍历它们而不用担心发送一堆查询,然后我们可以使用union()它基本上将每次迭代的结果 SQL 附加到下一个使用UNION给定数据库中的子句。这是一个例子:

library(dbplyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(purrr, warn.conflicts = FALSE)
library(DBI, warn.conflicts = FALSE)
library(rlang, warn.conflicts = FALSE)

con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")

db_mtcars <- copy_to(con, mtcars)

cyls <- c(4, 6, 8)

all <- cyls %>%
  map(~{
    db_mtcars %>%
      filter(cyl == .x) %>%
      summarise(mpg = mean(mpg, na.rm = TRUE)
      )
  }) %>%
  reduce(function(x, y) union(x, y)) 

all
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.22.0 []
#>     mpg
#>   <dbl>
#> 1  15.1
#> 2  19.7
#> 3  26.7

show_query(all)
#> <SQL>
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 4.0))
#> UNION
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 6.0))
#> UNION
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 8.0))

dbDisconnect(con)
于 2018-09-22T21:52:17.423 回答
3

dbplyr translates R into SQL. Lists don't exist in SQL. map creates lists. Thus it's impossible to translate map into SQL.

Mainly dplyr functions and some base functions are translated, they're working on tidyr functions too as I understood. When using dbplyr try to have an SQL logic in your approach or it will easily break.

于 2018-03-27T20:22:15.657 回答