python - bq.py 不分页结果

Question

我们正在为bq.py编写一个包装器，并且在结果集大于 100k 行时遇到了一些问题。过去这似乎效果很好（我们在 Odd Attempts 上遇到了 Google BigQuery Incomplete Query Replies 的相关问题）。也许我不理解文档页面上解释的限制？

例如：

#!/bin/bash

for i in `seq 99999 100002`;
do
    bq query -q --nouse_cache --max_rows 99999999 "SELECT id, FROM [publicdata:samples.wikipedia] LIMIT $i" > $i.txt
    j=$(cat $i.txt | wc -l)
    echo "Limit $i Returned $j Rows"
done

产量（注意有 4 行格式）：

Limit 99999 Returned   100003 Rows
Limit 100000 Returned   100004 Rows
Limit 100001 Returned   100004 Rows
Limit 100002 Returned   100004 Rows

在我们的包装器中，我们直接访问 API：

while row_count < total_rows:
    data = client.apiclient.tabledata().list(maxResults=total_rows - row_count,
                                                 pageToken=page_token,
                                                 **table_dict).execute()

    # If there are more results than will fit on a page, 
    # you will recieve a token for the next page
    page_token = data.get('pageToken', None)

    # How many rows are there across all pages?
    total_rows = min(total_rows, int(data['totalRows'])) # Changed to use get(data[rows],0)
    raw_page = data.get('rows', [])

在这种情况下，我们希望得到一个令牌，但没有返回。

score 1 · Accepted Answer

我可以使用 bq 命令行重现您看到的行为。这似乎是一个错误，我会看看我能做些什么来修复它。

关于您要查询的数据，我确实注意到的一件事是仅选择 id 字段，并将行数限制在 100,000 左右。这会产生大约 100 万左右的数据，因此服务器可能不会对结果进行分页。选择大量数据将强制服务器分页，因为它无法在单个响应中返回所有结果。如果您为 100,000 行 samples.wikipedia 选择 *，您将获得大约 50M 的回报，这应该足以开始看到一些分页发生。

您是否也看到从 python 客户端返回的结果太少，或者您对您的 samples.wikipedia 查询没有返回 page_token 感到惊讶？

score 1 · Accepted Answer

抱歉，我花了一点时间才回复你。

我能够识别出服务器端存在的错误，您最终会在 Java 客户端和 python 客户端中看到这一点。我们计划在下周推出修复程序。一旦发生这种情况，您的客户应该开始正确行事。

顺便说一句，我不确定您是否已经知道这一点，但是有一个完整的独立 python 客户端，您也可以使用它从 python 访问 API。我认为这对您来说可能比作为 bq.py 的一部分分发的客户端更方便一些。您将在此页面上找到指向它的链接： https ://developers.google.com/bigquery/client-libraries

python - bq.py 不分页结果

2 回答 2

Related

Reference