cassandra - Presto Cassandra 连接器聚类索引

Question

CQL 执行 [立即返回，假设使用集群键索引]：

cqlsh:stats> select count(*) from events where month='2015-04' and day = '2015-04-02';

 count
-------
  5447

Presto 执行 [大约需要 8 秒]：

presto:default> select count(*) as c from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02';
  c   
------
 5447 
(1 row)

Query 20150228_171912_00102_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:08 [147K rows, 144KB] [17.6K rows/s, 17.2KB/s]

当 cassandra 本身只响应同一查询的 5447 行时，为什么要快速处理 147K 行 [我也尝试过 select *]？

为什么 presto 不能使用聚类键优化？

我尝试了所有可能的值，例如时间戳、日期、不同格式的日期。无法看到对获取的行数的任何影响。

CF参考：

CREATE TABLE events (
  month text,
  day timestamp,
  test_data text,
  some_random_column text,
  event_time timestamp,
  PRIMARY KEY (month, day, event_time)
)  WITH comment='Test Data'
AND read_repair_chance = 1.0;

也添加了 event_timestamp 作为对 Dain 回答的约束

presto:default> select count(*) from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02 00:00:00+0000' and event_time = timestamp '2015-04-02 00:00:34+0000';
 _col0 
-------
     1 
(1 row)

Query 20150301_071417_00009_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:07 [147K rows, 144KB] [21.3K rows/s, 20.8KB/s]

score 1 · Accepted Answer

Presto 引擎会将像这样的简单 WHERE 子句下推到连接器（您可以在 Hive 连接器中看到这一点），所以问题是，为什么 Cassandra 连接器没有利用这一点。要了解原因，我们必须查看代码。

下推系统首先在 ConnectorSplitManager.getPartitions(ConnectorTableHandle, TupleDomain) 方法中与连接器交互，因此查看 CassandraSplitManager，我看到它正在将逻辑委托给 getPartitionKeysSet。此方法为主键中的每一列查找范围约束（例如，x=33 或 x BETWEEN 1 AND 10），因此在您的情况下，您需要在 event_time 上添加约束。

我不知道为什么代码坚持对主键中的每一列都有约束，但我猜这是一个错误。调整此代码以消除该约束应该很容易。

cassandra - Presto Cassandra 连接器聚类索引

1 回答 1

Related

Reference