2

我正在尝试使用 Google Big Query 从 GitHub 存档中获取一些数据。我当前请求的数据量太大,BigQuery 无法处理(至少在免费层中),所以我试图限制我的请求范围。

我想限制数据,以便只有当前拥有超过 1000 颗星的存储库才会返回历史数据。它比仅仅说 repository_watchers > 1000 更复杂,因为这将排除存储库获得的前 1000 颗星的历史数据。

SELECT repository_name, repository_owner, created_at, type, repository_url, repository_watchers
FROM [githubarchive:github.timeline]
WHERE type="WatchEvent"
ORDER BY created_at DESC

编辑:我使用的解决方案(基于@Brian 的回答)

select y.repository_name, y.repository_owner, y.created_at, y.type, y.repository_url, y.repository_watchers
  from [githubarchive:github.timeline] y
  join (select repository_url, max(repository_watchers)
          from [githubarchive:github.timeline] x
         where x.type = 'WatchEvent'
         group by repository_url
        having max(repository_watchers) > 1000) x
    on y.repository_url = x.repository_url
  where y.type = 'WatchEvent'
 order by y.repository_name, y.repository_owner, y.created_at desc
4

1 回答 1

3

尝试:

select y.*
  from [githubarchive :github.timeline] y
  join (select repository_name, max(repository_watchers)
          from [githubarchive :github.timeline]
         where x.type = 'WatchEvent'
         group by repository_name
        having max(repository_watchers) > 1000) x
    on y.repository_name = x.repository_name
 order by y.created_at desc

如果不支持该语法,您可以使用 3 步解决方案,如下所示:

第 1 步:查找哪些 REPOSITORY_NAME 值具有至少一条 REPOSITORY_WATCHERS 数量 > 1000 的记录

select repository_name, max(repository_watchers) as curr_watchers
  from [githubarchive :github.timeline]
 where type = 'WatchEvent'
 group by repository_name
having max(repository_watchers) > 1000

第 2 步:将该结果存储为表,称为 SUB

第 3 步:针对 SUB(和您的原始表)运行以下命令

select y.*
  from [githubarchive :github.timeline] y
  join sub x
    on y.repository_name = x.repository_name
 order by y.created_at desc
于 2014-07-27T23:06:02.550 回答