1

我在资源和性能方面一直在寻找 Joins vs Subquery,答案似乎取决于平台。但就 BigQuery 而言,似乎没有人谈论它们。

当我将查询范围扩大到 100 GB 时,我遇到了一个

Query Failed
Error: Resources exceeded during query execution.

我大致有

#standardSQL
SELECT * FROM table t1 WHERE 
(t1.a in (SELECT b FROM anothertable WHERE class='value') 
OR t1.a in (SELECT c FROM table2) )

我想知道 JOIN 在 BigQuery 中是否会更好,特别是如果我扩展到 TB 的数据。

4

2 回答 2

3

请注意此查询与下一个查询之间的区别:

1)

#standardSQL
SELECT COUNTIF(author IN (
   SELECT author 
   FROM `fh-bigquery.reddit_comments.2017_01` 

))
FROM `fh-bigquery.reddit_comments.2017_01`

2)

#standardSQL
SELECT COUNTIF(author IN (
   SELECT DISTINCT author 
   FROM `fh-bigquery.reddit_comments.2017_01` 
))
FROM `fh-bigquery.reddit_comments.2017_01`

这是一个愚蠢的查询 - 两者都应该返回157893170。尽管如此,1) 已经运行了超过 8 分钟(到目前为止),而 2) 运行了 36 秒。

秘诀?执行时IN(),请确保使用DISTINCT- 删除重复项,否则将有很多行 JOIN 根本不会改变结果。

// TODO(gcp): This could be a BigQuery optimization.
于 2017-08-02T21:11:46.290 回答
0

我想知道,您是否尝试过 Elliott 的使用建议EXISTS

就像是:

WITH table1 AS(
SELECT '1' as user, 1 AS id UNION ALL
SELECT '2' AS user, 2 as id UNION ALL
SELECT '3' AS user, 3 as id
),
anothertable AS(
SELECT '1' AS user, 'value' AS class , '4' AS c UNION ALL
SELECT '2' AS user, 'value2' AS class, '2' AS c UNION ALL
SELECT '4' AS user, 'value' AS class, '3' AS c UNION ALL
SELECT '5' AS user, 'value2' AS class, '5' as c
),
table2 AS(
SELECT '4' AS c UNION ALL
SELECT '2' AS c UNION ALL
SELECT '3' AS c UNION ALL
SELECT '5' as c
)

SELECT
  t1.*
FROM table1 t1
WHERE TRUE
AND EXISTS(SELECT 1 FROM anothertable ta WHERE (class = 'value' AND t1.user = ta.user))
OR EXISTS(SELECT 1 FROM table2 t2 WHERE t1.user = t2.c)

是否超过资源?

于 2017-08-03T12:59:36.460 回答