1

我正在尝试在 github 存档(http://www.githubarchive.org/)数据上使用 Google BigQuery 来获取存储库在其最新事件发生时的统计信息,并且我正在尝试为最多的存储库获取此信息观察者。我意识到这很多,但我觉得我真的很接近在一次查询中得到它。

这是我现在的查询:

SELECT repository_name, repository_owner, repository_organization, repository_size,  repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000

唯一的问题是我得到了来自最高关注存储库(twitter bootstrap)的所有事件:

结果:

Row repository_name repository_owner    repository_organization repository_size watchers    forks   repository_language time     
1   bootstrap           twbs                    twbs                   83875      61191     21602   JavaScript          1384991582000000     
2   bootstrap           twbs                    twbs                   83875      61190     21602   JavaScript          1384991337000000     
3   bootstrap           twbs                    twbs                   83875      61190     21603   JavaScript          1384989683000000

...

我怎样才能让它返回一个repository_name的单个结果(最近的,又名Max(time))?

我试过了:

SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
WHERE PARSE_UTC_USEC(created_at) IN (SELECT MAX(PARSE_UTC_USEC(created_at)) FROM [githubarchive:github.timeline])
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000

不确定这是否可行,但没关系,因为我收到错误消息:

Error: Join attribute is not defined: PARSE_UTC_USEC

任何帮助都会很棒,谢谢。

4

1 回答 1

4

该查询的一个问题是,如果有两个操作同时发生,您的结果可能会混淆。如果您只是按存储库名称分组以获得每个存储库的最大提交时间,然后加入反对以获得您想要的其他字段,您可以获得您想要的内容。例如:

SELECT
  a.repository_name as name,
  a.repository_owner as owner,
  a.repository_organization as organization,
  a.repository_size as size,
  a.repository_watchers AS watchers,
  a.repository_forks AS forks,
  a.repository_language as language,
  PARSE_UTC_USEC(created_at) AS time  
FROM [githubarchive:github.timeline] a
JOIN EACH
  (
     SELECT MAX(created_at) as max_created, repository_name 
     FROM [githubarchive:github.timeline]
     GROUP EACH BY repository_name
  ) b
  ON 
  b.max_created = a.created_at and
  b.repository_name = a.repository_name
ORDER BY watchers desc
LIMIT 1000  
于 2013-11-21T02:13:02.677 回答