我有一个外部 Hive 表,其结构基本上类似于:
CREATE EXTERNAL TABLE foo (time double, name string, value double)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'hfds://node/foodir
我为(name, value)
.
CREATE INDEX idx ON TABLE foo(name, value)
AS ’org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’
WITH DEFERRED REBUILD;
ALTER INDEX ts_idx ON trionsort REBUILD;
我的查询是:
SELETE minute, count(minute) AS mincount
FROM (SELECT round(time/60) AS minute FROM foo WHERE name = 'Foo'
and value > 100000) t2 GROUP BY minute ORDER BY mincount DESC LIMIT 1;
但是,尽管满足条件 ( name = 'Foo' and value > 100000
) 的行可能只占所有行的不到 0.1%。这个 Hive 查询仍然针对整个数据集运行,速度与在没有索引的表上运行相当。
索引方案或查询有什么问题吗?
运行的输出EXPLAIN SELECT...
[rn14n21] out: OK
[rn14n21] out: ABSTRACT SYNTAX TREE:
[rn14n21] out: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME log))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTION round (/ (TOK_TABLE_OR_COL time) 60)) hour)) (TOK_WHERE (> (TOK_TABLE_OR_COL value) 1000000)))) t2)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL hour)) (TOK_SELEXPR (TOK_FUNCTION count (TOK_TABLE_OR_COL hour)) hrcount)) (TOK_GROUPBY (TOK_TABLE_OR_COL hour)) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC (TOK_TABLE_OR_COL hrcount))) (TOK_LIMIT 3)))
[rn14n21] out:
[rn14n21] out: STAGE DEPENDENCIES:
[rn14n21] out: Stage-1 is a root stage
[rn14n21] out: Stage-2 depends on stages: Stage-1
[rn14n21] out: Stage-0 is a root stage
[rn14n21] out:
[rn14n21] out: STAGE PLANS:
[rn14n21] out: Stage: Stage-1
[rn14n21] out: Map Reduce
[rn14n21] out: Alias -> Map Operator Tree:
[rn14n21] out: t2:log
[rn14n21] out: TableScan
[rn14n21] out: alias: log
[rn14n21] out: Filter Operator
[rn14n21] out: predicate:
[rn14n21] out: expr: (value > 1000000.0)
[rn14n21] out: type: boolean
[rn14n21] out: Select Operator
[rn14n21] out: expressions:
[rn14n21] out: expr: round((time / 60))
[rn14n21] out: type: double
[rn14n21] out: outputColumnNames: _col0
[rn14n21] out: Group By Operator
[rn14n21] out: aggregations:
[rn14n21] out: expr: count(_col0)
[rn14n21] out: bucketGroup: false
[rn14n21] out: keys:
[rn14n21] out: expr: _col0
[rn14n21] out: type: double
[rn14n21] out: mode: hash
[rn14n21] out: outputColumnNames: _col0, _col1
[rn14n21] out: Reduce Output Operator
[rn14n21] out: key expressions:
[rn14n21] out: expr: _col0
[rn14n21] out: type: double
[rn14n21] out: sort order: +
[rn14n21] out: Map-reduce partition columns:
[rn14n21] out: expr: _col0
[rn14n21] out: type: double
[rn14n21] out: tag: -1
[rn14n21] out: value expressions:
[rn14n21] out: expr: _col1
[rn14n21] out: type: bigint
[rn14n21] out: Reduce Operator Tree:
[rn14n21] out: Group By Operator
[rn14n21] out: aggregations:
[rn14n21] out: expr: count(VALUE._col0)
[rn14n21] out: bucketGroup: false
[rn14n21] out: keys:
[rn14n21] out: expr: KEY._col0
[rn14n21] out: type: double
[rn14n21] out: mode: mergepartial
[rn14n21] out: outputColumnNames: _col0, _col1
[rn14n21] out: Select Operator
[rn14n21] out: expressions:
[rn14n21] out: expr: _col0
[rn14n21] out: type: double
[rn14n21] out: expr: _col1
[rn14n21] out: type: bigint
[rn14n21] out: outputColumnNames: _col0, _col1
[rn14n21] out: File Output Operator
[rn14n21] out: compressed: false
[rn14n21] out: GlobalTableId: 0
[rn14n21] out: table:
[rn14n21] out: input format: org.apache.hadoop.mapred.SequenceFileInputFormat
[rn14n21] out: output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
[rn14n21] out:
[rn14n21] out: Stage: Stage-2
[rn14n21] out: Map Reduce
[rn14n21] out: Alias -> Map Operator Tree:
[rn14n21] out: hdfs://rn14n21/tmp/hive-lei/hive_2013-09-12_21-19-33_247_861290513429832428/-mr-10002
[rn14n21] out: Reduce Output Operator
[rn14n21] out: key expressions:
[rn14n21] out: expr: _col1
[rn14n21] out: type: bigint
[rn14n21] out: sort order: -
[rn14n21] out: tag: -1
[rn14n21] out: value expressions:
[rn14n21] out: expr: _col0
[rn14n21] out: type: double
[rn14n21] out: expr: _col1
[rn14n21] out: type: bigint
[rn14n21] out: Reduce Operator Tree:
[rn14n21] out: Extract
[rn14n21] out: Limit
[rn14n21] out: File Output Operator
[rn14n21] out: compressed: false
[rn14n21] out: GlobalTableId: 0
[rn14n21] out: table:
[rn14n21] out: input format: org.apache.hadoop.mapred.TextInputFormat
[rn14n21] out: output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
[rn14n21] out:
[rn14n21] out: Stage: Stage-0
[rn14n21] out: Fetch Operator
[rn14n21] out: limit: 3
[rn14n21] out:
[rn14n21] out:
[rn14n21] out: Time taken: 11.284 seconds, Fetched: 99 row(s)