1

我有一个大约 5M 行的表。请注意,这只是一个 poc。最终,我们将需要处于 TB 范围内。我正在做一个自我加入来寻找产品的排列以进行市场篮子分析。

我需要找到组合在一个篮子中出现的次数、出现次数与总篮子的比率以及该项目在所有篮子中出现的次数。这是相当标准的。BigQuery 不支持在另一个选择的谓词中选择,所以我想我需要创建另一个连接。这就是我想出的-

select twoItem.upc1,twoItem.upc2,twoItem.twoItemOccurrences, totalUpc.totalUpcCount
from
(
    select purchase1.upc as upc1,purchase2.upc as upc2,count(upc1) as twoItemOccurrences
    from
    conagra.purchase as purchase1
    join each conagra.purchase as purchase2
    on purchase1.upc = purchase2.upc
    group by upc1,upc2
) as twoItem
JOIN EACH 
(
    select purchase3.upc as upc3, count(*) as totalUpcCount
    from conagra.purchase as purchase3
    group by upc3
) as totalUpc
on totalUpc.upc3 = twoItem.upc1
LIMIT 50;

我收到以下错误:

SHUFFLE BY只能应用于可并行化查询,但查询不可并行化:(SELECT * FROM (SELECT [purchase3.upc] AS [upc3], COUNT(*) AS [totalUpcCount]...

也许是未发布的限制?

任何帮助,将不胜感激。

4

1 回答 1

1

尝试GROUP EACH BY在您的内部查询上运行这些。我们将改进此类查询的响应消息。

于 2013-04-13T18:35:38.523 回答