sql - 在 Spanner 中避免使用 IN 子句和子查询进行哈希连接

Question

我在 Spanner 中有以下查询优化问题，希望我缺少一个技巧，可以帮助我根据自己的意愿调整查询计划器。

这是简化的架构：

create table T0 (
  key0  int64 not null,
  value int64,
  other int64 not null,
) primary key (key0);

create table T1 {
  key1  int64 not null,
  other int64 not null
} primary key (key1);

以及在子句中带有子查询的查询IN：

select value from T0 t0
where t0.other in (
  select t1.other from T1 t1 where t1.key1 in (42, 43, 44)  -- note: this subquery is a good deal more complex than this
)

它通过对子查询输出的 T0 哈希连接产生一个 10 元素集：

Operator                     Rows  Executions
-----------------------      ----- ----------
Serialize Result               10          1
Hash Join                      10          1
  Distributed union         10000          1
    Local distributed union 10000          1
    Table Scan: T0          10000          1
  Distributed cross apply:      5          1
   ...lots moar T1 subquery stuff...

请注意，虽然子查询很复杂，但它实际上会产生一个非常小的集合。不幸的是，它还会扫描整个T1 以提供给散列连接，这非常慢。

但是，如果我在 T1 上获取子查询的输出并将其手动推入IN子句：

select value from T0
where other in (5, 6, 7, 8, 9)  -- presume this `IN` clause to be the output of the above subquery

它的速度要快得多，大概是因为它每个条目只命中 T0 的索引一次，而不是对完整内容使用散列连接：

Operator                Rows Executions
----------------------- ---- ----------
Distributed union         10          1
Local distributed union   10          1
Serialize Result          10          1
Filter                    10          1
Index Scan:               10          1

我可以简单地运行两个查询，这是我迄今为止最好的计划。但我希望我能找到一些方法来哄骗 Spanner 决定这是它应该对第一个示例中的子查询的输出执行的操作。我已经尝试了我能想到的一切，但这可能根本无法在 SQL 中表达。

另外：我还没有完全证明这一点，但在某些情况下，我担心 10 个元素的子查询输出可能会爆炸到几千个元素（T1 或多或少会无限制地增长，很容易达到数百万）。我已经在 splatted-out 子句中手动测试了几百个元素，IN它的性能似乎可以接受，但我有点担心它可能会失控。

请注意，我还尝试了子查询的连接，如下所示：

select t0.other from T0 t0
join (
  -- Yes, this could be a simple join rather than a subquery, but in practice it's complex
  -- enough that it can't be expressed that way.
  select t1.other from T1 t1 where t1.key = 42
) sub on sub.other = t0.other

但它在查询规划器中做了一些真正可怕的事情，我什至不会在这里解释。

score 2 · Accepted Answer

您在子句中的实际子查询是否IN使用来自的任何变量T0？如果不是，如果您尝试使用重新排序的表进行连接查询（并为正确性添加不同的，除非您知道这些值将是不同的），会发生什么？

SELECT t0.other FROM  (
      -- Yes, this could be a simple join rather than a subquery, but in practice it's complex
      -- enough that it can't be expressed that way.
      SELECT DISTINCT t1.other FROM T1 t1 WHERE t1.key = 42
    ) sub 
JOIN T0 t0
ON sub.other = t0.other

sql - 在 Spanner 中避免使用 IN 子句和子查询进行哈希连接

1 回答 1

Related

Reference