csv - 为什么 Apache Calcite 为查询包含的所有表估计 100 行？

Question

我最近尝试使用三个 CSV 文件作为表在 Apache Calcite 中执行查询

TTLA_ONE 包含 59 行
TTLR_ONE 包含 61390 行
EMPTY_T 包含 0 行

这是执行的查询：

EXPLAIN PLAN FOR SELECT COUNT(*) as NUM 
FROM TTLA_ONE A 
INNER JOIN TTLR_ONE B1 ON A.X = B1.X
INNER JOIN TTLR_ONE B2 ON B2.X = B1.X
INNER JOIN EMPTY_T C1 ON C1.X = B2.Y
INNER JOIN EMPTY_T C2 ON C2.X = C2.X

查询的结果始终为零，因为我们正在加入一个空表。得到的方案是：

EnumerableAggregate(group=[{}], NUM=[COUNT()])
  EnumerableJoin(condition=[=($1, $4)], joinType=[inner])
    EnumerableJoin(condition=[=($0, $1)], joinType=[inner])
      EnumerableInterpreter
        BindableTableScan(table=[[STYPES, TTLA_ONE]])
      EnumerableCalc(expr#0..1=[{inputs}], X=[$t0])
        EnumerableInterpreter
          BindableTableScan(table=[[STYPES, TTLR_ONE]])
    EnumerableJoin(condition=[=($1, $3)], joinType=[inner])
      EnumerableJoin(condition=[true], joinType=[inner])
        EnumerableCalc(expr#0=[{inputs}], expr#1=[IS NOT NULL($t0)], X=[$t0], $condition=[$t1])
          EnumerableInterpreter
            BindableTableScan(table=[[STYPES, EMPTY_T]])
        EnumerableInterpreter
          BindableTableScan(table=[[STYPES, EMPTY_T]])
      EnumerableInterpreter
        BindableTableScan(table=[[STYPES, TTLR_ONE]])

可以注意到，最后在计划中使用了空表。

我在此测试代码上添加了一个示例。

我更深入地研究了代码并打开了日志进行调试，我看到所有表行估计为 100，但事实并非如此。

下面，可以通过调试模式设置的日志找到计划估计：

  EnumerableJoin(condition=[=($1, $4)], joinType=[inner]): rowcount = 3.0375E7, cumulative cost = {3.075002214917643E7 rows, 950.0 cpu, 0.0 io}, id = 26284
EnumerableJoin(condition=[=($0, $1)], joinType=[inner]): rowcount = 1500.0, cumulative cost = {2260.517018598809 rows, 400.0 cpu, 0.0 io}, id = 26267
  EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26260
    BindableTableScan(table=[[STYPES, TTLA_ONE]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7789
  EnumerableCalc(expr#0..1=[{inputs}], X=[$t0]): rowcount = 100.0, cumulative cost = {150.0 rows, 350.0 cpu, 0.0 io}, id = 26290
    EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26263
      BindableTableScan(table=[[STYPES, TTLR_ONE]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7791
EnumerableJoin(condition=[=($1, $3)], joinType=[inner]): rowcount = 135000.0, cumulative cost = {226790.8015771949 rows, 550.0 cpu, 0.0 io}, id = 26282
  EnumerableJoin(condition=[true], joinType=[inner]): rowcount = 9000.0, cumulative cost = {9695.982870329724 rows, 500.0 cpu, 0.0 io}, id = 26277
    EnumerableCalc(expr#0=[{inputs}], expr#1=[IS NOT NULL($t0)], X=[$t0], $condition=[$t1]): rowcount = 90.0, cumulative cost = {140.0 rows, 450.0 cpu, 0.0 io}, id = 26288
      EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26270
        BindableTableScan(table=[[STYPES, EMPTY_T]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7787
    EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26275
      BindableTableScan(table=[[STYPES, EMPTY_T]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7787
  EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26280
    BindableTableScan(table=[[STYPES, TTLR_ONE]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7791

我们可以肯定地看到，对于每个表，估计值始终为 100 rowcount = 100.0。

查询正确执行，但计划未优化。有谁知道为什么没有正确评估表统计信息？

score 1 · Accepted Answer

这里的答案似乎与评论中已经链接的问题相同。

Flink 还没有（还）重新排序连接

在当前版本（1.7.1，2019 年 1 月）中，... Calcite 使用其默认值 100。

所以执行计划不是在寻找零行的表。特别是，我从这些答案中怀疑，即使您对子FROM句中的表重新排序，它仍然不会注意到。

一般来说，SQL 优化是由索引的可用性和表的基数驱动的。

为表注入基数估计的唯一方法是通过ExternalCatalog.

你这样做吗？

如果您将这些表加载为 CSV 文件，您是否声明了键和索引以及目录所需的其他内容？

听起来方解石不是一个成熟的产品。如果您正在寻找一个测试平台来检查 SQL 优化/查询计划，请使用不同的产品。

score 0 · Accepted Answer

问题是在类CsvTable中必须通过执行以下操作来覆盖 getStatistic属性方法：

 private Statistic statistic;
 // todo: assign statistics  

  @Override
  public Statistic getStatistic() {
    return statistic;
  }

可能从构造函数传递这些统计信息或注入一些生成它们的对象。

目前它只返回Statistics.UNKNOWN超类实现 AbstractTable 中的那个。当然，如果没有统计数据，该计划的估计成本是不正确的。

csv - 为什么 Apache Calcite 为查询包含的所有表估计 100 行？

2 回答 2

Related

Reference