6

如果可能,我想bigrquery使用dplyr语法(而不是 SQL)探索 Google Analytics 360 数据。要点是我想了解用户旅程——我有兴趣在用户级别(甚至跨会话)找到最常见的页面序列。

我以为我可以这样做:

sample_query <- ga_sample %>%
  select(fullVisitorId, date, visitStartTime, totals, channelGrouping,
  hits.page.pagePath) %>% 
  collect()

但我收到一个hits.page.pagePath未找到的错误。然后我尝试了:

sample_query <- ga_sample %>%
  select(fullVisitorId, date, visitStartTime, totals, channelGrouping, hits) %>% 
  collect() %>% 
  unnest_wider(hits)

但结果是Error: Requested Resource Too Large to Return [responseTooLarge],这是完全合理的。

根据我收集到的信息,使用 SQL 语法,解决方法是unnest远程处理,并且select只处理hits.page.pagePath字段(而不是整个hits顶级字段)。

例如,像这样的东西(这是一个不同的查询,但传达了这一点):

SELECT
  hits.page.pagePath
FROM
  'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
  UNNEST(GA.hits) AS hits
GROUP BY
  hits.page.pagePath

dplyr是否可以用语法做类似的事情?如果不可能,使用 SQL 的最佳方法是什么?

谢谢!

更新:实际查询/代码

SELECT DISTINCT
fullVisitorId, visitId, date, visitStartTime, hits.page.pagePath, hits.time, geoNetwork.networkDomain
FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_*' AS GA, UNNEST(GA.hits) AS hits
WHERE _TABLE_SUFFIX BETWEEN "20191101" AND "20191102"
AND geoNetwork.networkDomain NOT LIKE "%google%"
4

2 回答 2

3

从 R 转换为 BigQuery(或您使用的任何数据库语言)时可以创建的查询类型dbplyr取决于 R 和 BigQuery 之间定义的转换。我找不到任何建议UNNEST在现有dbplyr包中定义翻译的示例。参考1参考2

一种解决方法是定义一个自定义函数,而不是在 内进行翻译dbplyr,而是在dbplyr. 在我需要PIVOTSQL 但找不到tidyr::spread.

该方法有效,因为远程表dbplyr由两件事定义:(1)到远程数据库的连接,(2)返回表的当前视图的代码/查询。因此,一旦dbplyr将 R 转换为 BigQuery 或 SQL,它就会更新定义的后半部分。

我们可以使用自定义函数来做到这一点:

unnest <- function(input_tbl, select_columns, array_column, unnested_columns){

  # extract connection
  db_connection <- input_tbl$src$con

  select_columns = paste0(select_columns, collapse = ", ")
  unnested_columns = paste0(paste0("un.", unnested_columns), collapse = ", ")

  # build SQL unnest query
  sql_query <- dbplyr::build_sql(
    con = db_connection
    ,"SELECT ", select_columns, ", ", position, ", ", unnested_columns, "\n"
    ,"FROM (\n"
    ,dbplyr::sql_render(input_tbl)
    ,"\n) AS src\n"
    ,"CROSS JOIN UNNEST(", array_column, ") AS un WITH OFFSET position"
  )

  return(dplyr::tbl(db_connection, dbplyr::sql(sql_query)))
}

请注意,我是dbplyr用户,但不是 BigQuery 用户,所以我在上面的语法可能不是很完美。我已经关注了这个问题这个问题的语法。

示例使用:

remote_table = tbl(bigquery_connection, from = "table_name")
unnested_table = unnest(remote_table, "ID", "array_col", "list")

# check syntax of dbplyr query
unnested_table %>% show_query()
# if this is not a valid bigquery query then next command will error

# view top 10 rows
unnested_table %>% head(10)

如果remote_table看起来像:

ID ARRAY_COL
01 list = [a,b,c]
02 list = [d,e]
03 list = [q]

然后unnested_table应该看起来像:

ID POSITION un.list
01    0        a
01    1        b
01    2        c
02    0        d
02    1        e
03    0        q

unnested_table %>% show_query()应该看起来像:

<SQL>
SELECT *, position, un.list
FROM (
    SELECT *
    FROM table_name
) AS src
CROSS JOIN UNNEST(ARRAY_COL) AS un WITH OFFSET position

更新以匹配目标查询

我知道没有可以轻松dbplyr翻译的功能,_TABLE_SUFFIX BETWEEN "20191101" AND "20191102"因此您将不得不以另一种方式处理这个问题——也许循环遍历 R 中的日期列表。

第一步是dbplyr在取消嵌套之前呈现查询。大概是这样的:

for(date in c("20191101", "20191102")){
    table_name = paste0("bigquery-public-data.google_analytics_sample.ga_sessions_",date)

    remote_table = tbl(bigquery_connection, from = table_name)

    remote_table = remote_table %>%
        filter(! (geoNetwork.networkDomain %like% "%google%")) %>%
        select(fullVisitorId, visitId, date, visitStartTime, hits, geoNetwork.networkDomain) %>%
        distinct()
}

然后调用show_query(remote_table)应该产生与以下内容等效的内容。但它不会完全相同,因为dbplyr编写代码的方式与人类不同。

SELECT DISTINCT fullVisitorId, visitId, date, visitStartTime, hits, geoNetwork.networkDomain
FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20191101'
WHERE NOT(geoNetwork.networkDomain LIKE "%google%")

第二步,调用自定义的unnest函数”

remote_table = unnest(remote_table,
                      select_columns = c("fullVisitorId", "visitId", "date", "visitStartTime", "geoNetwork.networkDomain"),
                      array_column = "hits",
                      unnested_columns = c("page.pagePath", "time")
               )

然后调用show_query(remote_table)应产生以下结果:

SELECT fullVisitorId, visitId, date, visitStartTime, geoNetwork.networkDomain, position, un.page.pagePath, un.time, 
FROM (

the_query_from_the_first_step

) AS src
CROSS JOIN UNNEST(src.hits) AS un WITH OFFSET position

这可能是我所能提供的帮助,因为我没有一个 bigquery 环境来在自己身上测试它。您可能必须调整自定义unnest函数以使其与您的上下文完全匹配。希望以上内容足以让您入门。

于 2019-12-16T20:57:43.667 回答
0

如评论中所述,Simon.SA 提供的功能不起作用(勇敢的尝试,但不熟悉 bigquery)。

我做了一些改动来创建一个可以处理单个嵌套变量的函数。

library(magrittr)
library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(bigrquery)

bq_deauth()
bq_auth(email="your_email@domain.com")

bq_conn = dbConnect(
  bigquery(),
  project = "elite-magpie-257717",
  dataset = "test_dataset"
)

df = tibble(
  chr =   c(1,1,1,2,2,3),
  start = c(0, 10, 12, 0, 5, 1),
  end =   c(2, 11, 15, 1, 8, 3)
)

df %>%
  rowwise() %>% mutate(range = list(seq(start, end)))
#> # A tibble: 6 x 4
#> # Rowwise: 
#>     chr start   end range    
#>   <dbl> <dbl> <dbl> <list>   
#> 1     1     0     2 <int [3]>
#> 2     1    10    11 <int [2]>
#> 3     1    12    15 <int [4]>
#> 4     2     0     1 <int [2]>
#> 5     2     5     8 <int [4]>
#> 6     3     1     3 <int [3]>

df %>%
  rowwise() %>% mutate(range = list(seq(start, end))) %>%
  unnest(range)
#> # A tibble: 18 x 4
#>      chr start   end range
#>    <dbl> <dbl> <dbl> <int>
#>  1     1     0     2     0
#>  2     1     0     2     1
#>  3     1     0     2     2
#>  4     1    10    11    10
#>  5     1    10    11    11
#>  6     1    12    15    12
#>  7     1    12    15    13
#>  8     1    12    15    14
#>  9     1    12    15    15
#> 10     2     0     1     0
#> 11     2     0     1     1
#> 12     2     5     8     5
#> 13     2     5     8     6
#> 14     2     5     8     7
#> 15     2     5     8     8
#> 16     3     1     3     1
#> 17     3     1     3     2
#> 18     3     1     3     3

dbWriteTable(
  bq_conn,
  name = "test_dataset.range_test",
  value = df,
  overwrite = T
)

df_bq = tbl(bq_conn, "test_dataset.range_test")

df_bq %>%
  mutate(range = generate_array(start, end, 1))
#> # Source:   lazy query [?? x 4]
#> # Database: BigQueryConnection
#>     end start   chr range    
#>   <int> <int> <int> <list>   
#> 1     2     0     1 <dbl [3]>
#> 2    11    10     1 <dbl [2]>
#> 3    15    12     1 <dbl [4]>
#> 4     1     0     2 <dbl [2]>
#> 5     8     5     2 <dbl [4]>
#> 6     3     1     3 <dbl [3]>

df_bq %>%
  mutate(range = generate_array(start, end, 1)) %>%
  unnest_wider(range)
#> Error: `x` must be a vector, not a `tbl_BigQueryConnection/tbl_dbi/tbl_sql/tbl_lazy/tbl` object.


my_unnest = function(input_tbl, array_column)
{

  ### extract connection
  db_connection = input_tbl$src$con

  ### column names surrounded by `` and separated by commas
  all_cols =
    colnames(input_tbl) %>%
    sprintf("`%s`", .) %>%
    paste(., collapse=", ")

  ### Build sql string
  sql_string =
    paste0(
      "SELECT ", all_cols,
      "FROM (", dbplyr::sql_render(input_tbl), ") ",
      "CROSS JOIN UNNEST(`", array_column, "`) AS `", array_column, "`"
    ) %>%
    str_replace("\n", " ")

  ### Build query object
  sql_query = dbplyr::sql(sql_string)

  print(sql_query)

  dplyr::tbl(db_connection, sql_query)

  return(dplyr::tbl(db_connection, sql_query))
}


df_bq %>%
  mutate(range = generate_array(start, end, 1)) %>%
  my_unnest("range")
#> <SQL> SELECT `end`, `start`, `chr`, `range`FROM (SELECT `end`, `start`, `chr`, generate_array(`start`, `end`, 1.0) AS `range` FROM `test_dataset.range_test`) CROSS JOIN UNNEST(`range`) AS `range`
#> # Source:   SQL [?? x 4]
#> # Database: BigQueryConnection
#>      end start   chr range
#>    <int> <int> <int> <dbl>
#>  1     2     0     1     0
#>  2     2     0     1     1
#>  3     2     0     1     2
#>  4    11    10     1    10
#>  5    11    10     1    11
#>  6    15    12     1    12
#>  7    15    12     1    13
#>  8    15    12     1    14
#>  9    15    12     1    15
#> 10     1     0     2     0
#> # ... with more rows

reprex 包于 2021-02-18 创建(v1.0.0)

请注意,确保在连接中指定数据集(而不仅仅是项目)很重要,否则它将因缺少数据集而引发错误。

此外,如果您调用该函数unnest,您将破坏tidyr::unnest您可能不想做的事情。

于 2021-02-19T00:18:16.767 回答