google-bigquery - Google BigQuery 加入每个错误

Question

我正在尝试运行一个连接太大数据集的简单查询，但我遇到了各种错误。此处转载的是使用公共数据库的类似查询

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] as gn1 inner join (select actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name from [publicdata:samples.github_nested] group by actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) as gn2 on gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

如果没有“inner join each”，则表示数据库太大。当我使用“inner join each”运行查询时，大查询会给出错误说明：

无法执行分区连接，因为 gn2 不可并行化：（SELECT [actor_attributes.blog]、[actor_attributes.company]、[actor_attributes.email]、[actor_attributes.gravatar_id]、[actor_attributes.location]、[actor_attributes.login]、[actor_attributes .name] FROM [publicdata:samples.github_nested] GROUP BY [actor_attributes.blog], [actor_attributes.company], [actor_attributes.email], [actor_attributes.gravatar_id], [actor_attributes.location], [actor_attributes.login], [ actor_attributes.name])

任何帮助将不胜感激

谢谢

score 1 · Accepted Answer

感谢您提供 [非] 有效的公开示例。使调试更容易。

重新格式化原始查询：

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] AS gn1 
INNER JOIN EACH (
  SELECT actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name
  FROM [publicdata:samples.github_nested]
  GROUP BY actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) 
AS gn2 ON gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

该查询确实因所述错误消息而失败。虽然错误消息可能会更好，但解决方案很简单：只需将 EACH 添加到子查询的 GROUP BY 中，以使其可并行化：

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] AS gn1 
INNER JOIN EACH (
  SELECT actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name
  FROM [publicdata:samples.github_nested]
  GROUP EACH BY actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) 
AS gn2 ON gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

[查询完成（经过 12.7 秒，已处理 237 MB）]

google-bigquery - Google BigQuery 加入每个错误

1 回答 1

Related

Reference