github - 为什么 Big Query 上的 Github Archive 中的 fork 数量与 UI 不匹配？

Question

我正在尝试通过 Big Query（此处的文档）获取 Github 存档中的各种 Github 存储库指标。但是，当我尝试计算分叉数量时，我得到的数字与 Github UI 中指定的分叉数量有很大不同。例如，当我运行这个 sql 脚本时：

SELECT repo.url,repo.name , COUNT(*) fork_count, 
FROM [githubarchive:year.2011],
  [githubarchive:year.2012],
  [githubarchive:year.2013],
  [githubarchive:year.2014],
  [githubarchive:year.2015],
  [githubarchive:year.2016],
  [githubarchive:year.2017],
  [githubarchive:year.2018],
  [githubarchive:month.201901]
WHERE type='ForkEvent'
and repo.url like 'https://github.com/python/cpython'
GROUP BY 1,2

我得到以下结果：

Row repo_url                           repo_name   fork_count    
1   https://github.com/python/cpython   cpython    177

但是，当我转到 URL ' https://github.com/python/cpython ' 时，我看到有 8,198 个分叉。这种差异的原因是什么？

编辑：

Felipe 在下面指出，同一个 repo 可能有多个 URL。

然而，即使有多个 URL，这个数字也不能与 UI 完全匹配，而且这一次比 UI 的数字要大得多。有什么办法可以精确匹配吗？

score 1 · Accepted Answer

你在查询什么？请注意，如果您选择 repo id、name 或 url，您将获得不同的结果：

#standardSQL
SELECT repo.name, repo.id, repo.url, COUNT(*) c
FROM `githubarchive.month.201*`
WHERE type='ForkEvent'
AND (
  repo.id = 81598961 
  OR repo.name='python/cpython'
  OR repo.url like 'https://github.com/python/cpython'
)
GROUP BY 1,2,3

如果您想知道“何时？”：

#standardSQL
SELECT repo.name, repo.id, repo.url, COUNT(*) c
  , MIN(DATE(created_at)) since, MAX(DATE(created_at)) until
FROM `githubarchive.month.201*`
WHERE type='ForkEvent'
AND (
  repo.id = 81598961 
  OR repo.name='python/cpython'
  OR repo.url like 'https://github.com/python/cpython'
)
GROUP BY 1,2,3
ORDER BY since

编辑：

GitHub 仅列出每个用户的一个分叉 - 因此，如果您想删除重复项，请执行 COUNT(DISTINCT actor.id) 将其降至约 9k。

github - 为什么 Big Query 上的 Github Archive 中的 fork 数量与 UI 不匹配？

1 回答 1

Related

Reference