python - 使用 Python 循环遍历所有 BigQuery 作业

Question

我正在使用 Google Python API 来处理 BigQuery。

我正在尝试使用jobs().list()和jobs().list_next()分页我项目中的所有工作。我正在使用带有以下代码的生成器：

request = service.jobs().list(projectId=project_id,
                              allUsers=True,
                              stateFilter="done",
                              )
                              # or maxResults=500) 
                              # or maxResults=1000) 
                              # or maxResults=64000)
while request is not None:
    response = request.execute()
    for x in response["jobs"]:
        yield x
    request = service.jobs().list_next(request, response)

问题是，根据我的使用方式maxResults，我会得到不同的工作清单。

不使用任何maxResults参数，我看到了 9986 个工作。
使用maxResults=500我看到 8596 个工作。
使用maxResults=1000我看到 6743 个工作。
使用maxResults=64000我看到 6743 个工作。

我希望每次的作业数量都相同，所以我不确定我是否正确使用了 API。

循环遍历项目中所有作业的正确方法是什么？

（2013 年 8 月 14 日星期三 15:30:29 CDT 更新）

仍在试图弄清楚这一点。我使用不同的maxResults. 关于每次报告的工作数量以及它们之间的关系的各种信息如下：

s1 -> no maxResults
s2 -> maxResults=500
s3 -> maxResults=1000

|s1| -> 10112
|s2| -> 8579
|s3| -> 6556

|s1 intersection s2| -> 8578
|s2 difference s1| -> 1
|s1 difference s2| -> 1534

|s1 intersection s3| -> 6556
|s3 difference s1| -> 0
|s1 difference s3| -> 3556

|s3 intersection s2| -> 6398
|s2 difference s3| -> 2181
|s3 difference s2| -> 158

我仍然无法理解为什么无论maxResults使用什么，我都没有看到一致的工作总数。

score 0 · Accepted Answer

首先，[bigquery_client.py Python 模块][1] 是从 Python 访问 API 的好方法，它构建在原始客户端库之上，具有额外的错误处理、分页等：

我不确定您是否正确使用了页面令牌？你能确认你正在检查 nextPageToken 吗？这是我之前使用过的一个示例：

import httplib2
import pprint
import sys

from apiclient.discovery import build
from apiclient.errors import HttpError

from oauth2client.client import AccessTokenRefreshError
from oauth2client.client import OAuth2WebServerFlow
from oauth2client.client import flow_from_clientsecrets
from oauth2client.file import Storage
from oauth2client.tools import run


# Enter your Google Developer Project number
PROJECT_NUMBER = 'XXXXXXXXXXXXX'

FLOW = flow_from_clientsecrets('client_secrets.json',
                               scope='https://www.googleapis.com/auth/bigquery')



def main():

  storage = Storage('bigquery_credentials.dat')
  credentials = storage.get()

  if credentials is None or credentials.invalid:
    credentials = run(FLOW, storage)

  http = httplib2.Http()
  http = credentials.authorize(http)

  bigquery_service = build('bigquery', 'v2', http=http)
  jobs = bigquery_service.jobs()

  page_token=None
  count=0

  while True:
    response = list_jobs_page(jobs, page_token)
    if response['jobs'] is not None:
      for job in response['jobs']:
        count += 1
        print '%d. %s\t%s\t%s' % (count,
                                  job['jobReference']['jobId'],
                                  job['state'],
                                  job['errorResult']['reason'] if job.get('errorResult') else '')
    if response.get('nextPageToken'):
      page_token = response['nextPageToken']
    else:
      break




def list_jobs_page(jobs, page_token=None):
  try:
    jobs_list = jobs.list(projectId=PROJECT_NUMBER,
                          projection='minimal',
                          allUsers=True,
                  # You can set a custom maxResults
                          # here
                          # maxResults=500,
                          pageToken=page_token).execute()

    return jobs_list

  except HttpError as err:
    print 'Error:', pprint.pprint(err.content)



if __name__ == '__main__':
  main()


  [1]: https://code.google.com/p/google-bigquery-tools/source/browse/bq/bigquery_client.py#1078

python - 使用 Python 循环遍历所有 BigQuery 作业

1 回答 1

Related

Reference