r - 如何从 BigQuery 将大型数据集加载到 R？

Question

我用Bigrquery包尝试了两种方法，这样

library(bigrquery)
library(DBI)

con <- dbConnect(
  bigrquery::bigquery(),
  project = "YOUR PROJECT ID HERE",
  dataset = "YOUR DATASET"
)
test<- dbGetQuery(con, sql, n = 10000, max_pages = Inf)

和

sql <- `YOUR LARGE QUERY HERE` #long query saved to View and its select here
tb <- bigrquery::bq_project_query(project, sql)
bq_table_download(tb, max_results = 1000)

但是没有出现错误"Error: Requested Resource Too Large to Return [responseTooLarge]"，可能与此处相关的问题，但我对完成工作的任何工具感兴趣：我已经尝试了此处概述的解决方案，但它们失败了。

如何将大型数据集从 BigQuery 加载到 R？

score 2 · Accepted Answer

正如@hrbrmstr 所建议的那样，文档特别提到：

> #' @param page_size The number of rows returned per page. Make this smaller
> #'   if you have many fields or large records and you are seeing a
> #'   'responseTooLarge' error.

在来自 r-project.org 的文档中，您将在此功能的解释中阅读不同的建议（第 13 页）：

这将检索 page_size 块中的行。它最适合较小查询的结果（例如，<100 MB）。对于较大的查询，最好将结果导出到存储在 google cloud 的 CSV 文件中，并使用 bq 命令行工具在本地下载。

score 2 · Accepted Answer

我看到有人创造了一种使这更容易的方法。涉及一些设置，但您可以使用 Google Storage API 下载，如下所示：

## Auth is done automagically using Application Default Credentials.
## Use the following command once to set it up :
## gcloud auth application-default login --billing-project={project}
library(bigrquerystorage)

# TODO(developer): Set the project_id variable.
# project_id <- 'your-project-id'
#
# The read session is created in this project. This project can be
# different from that which contains the table.

rows <- bqs_table_download(
  x = "bigquery-public-data:usa_names.usa_1910_current"
  , parent = project_id
  # , snapshot_time = Sys.time() # a POSIX time
  , selected_fields = c("name", "number", "state"),
  , row_restriction = 'state = "WA"'
  # , as_tibble = TRUE # FALSE : arrow, TRUE : arrow->as.data.frame
)

sprintf("Got %d unique names in states: %s",
        length(unique(rows$name)),
        paste(unique(rows$state), collapse = " "))

# Replace bigrquery::bq_download_table
library(bigrquery)
rows <- bigrquery::bq_table_download("bigquery-public-data.usa_names.usa_1910_current")
# Downloading 6,122,890 rows in 613 pages.
overload_bq_table_download(project_id)
rows <- bigrquery::bq_table_download("bigquery-public-data.usa_names.usa_1910_current")
# Streamed 6122890 rows in 5980 messages.

score 0 · Accepted Answer

我也刚开始使用 BigQuery。我认为应该是这样的。

当前的 bigrquery 版本可以从 CRAN 安装：

install.packages("bigrquery")

可以从 GitHub 安装最新的开发版本：

install.packages('devtools')
devtools::install_github("r-dbi/bigrquery")

使用底层 API

library(bigrquery)
billing <- bq_test_project() # replace this with your project ID 
sql <- "SELECT year, month, day, weight_pounds FROM `publicdata.samples.natality`"

tb <- bq_project_query(billing, sql)
#> Auto-refreshing stale OAuth token.
bq_table_download(tb, max_results = 10)

DBI

library(DBI)

con <- dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = billing
)
con 
#> <BigQueryConnection>
#>   Dataset: publicdata.samples
#>   Billing: bigrquery-examples

dbListTables(con)
#> [1] "github_nested"   "github_timeline" "gsod"            "natality"       
#> [5] "shakespeare"     "trigrams"        "wikipedia"

dbGetQuery(con, sql, n = 10)



library(dplyr)

natality <- tbl(con, "natality")

natality %>%
  select(year, month, day, weight_pounds) %>% 
  head(10) %>%
  collect()

score 0 · Accepted Answer

这对我有用。

# Make page_size some value greater than the default (10000)
x <- 50000

bq_table_download(tb, page_size=x)

请注意，如果您设置page_size为任意高值（在我的情况下为 100000），您将开始看到很多空行。

page_size对于给定的表格大小，正确的值应该是多少，仍然没有找到一个好的“经验法则” 。

r - 如何从 BigQuery 将大型数据集加载到 R？

4 回答 4

Related

Reference