我有一个大约 5M 行的表。请注意,这只是一个 poc。最终,我们将需要处于 TB 范围内。我正在做一个自我加入来寻找产品的排列以进行市场篮子分析。
我需要找到组合在一个篮子中出现的次数、出现次数与总篮子的比率以及该项目在所有篮子中出现的次数。这是相当标准的。BigQuery 不支持在另一个选择的谓词中选择,所以我想我需要创建另一个连接。这就是我想出的-
select twoItem.upc1,twoItem.upc2,twoItem.twoItemOccurrences, totalUpc.totalUpcCount
from
(
select purchase1.upc as upc1,purchase2.upc as upc2,count(upc1) as twoItemOccurrences
from
conagra.purchase as purchase1
join each conagra.purchase as purchase2
on purchase1.upc = purchase2.upc
group by upc1,upc2
) as twoItem
JOIN EACH
(
select purchase3.upc as upc3, count(*) as totalUpcCount
from conagra.purchase as purchase3
group by upc3
) as totalUpc
on totalUpc.upc3 = twoItem.upc1
LIMIT 50;
我收到以下错误:
SHUFFLE BY
只能应用于可并行化查询,但查询不可并行化:(SELECT * FROM (SELECT [purchase3.upc] AS [upc3], COUNT(*) AS [totalUpcCount]...
也许是未发布的限制?
任何帮助,将不胜感激。