总结:我觉得我的系统忽略了预排序表的概念。- 我希望节省排序步骤的时间,因为我使用的是预先排序的数据,但查询计划似乎表明中间排序步骤。
肮脏的细节如下:
设置 =======
我设置了以下标志:=============
set hive.enforce.bucketing = true;
set mapred.reduce.tasks=8;
set mapred.map.tasks=8;
在这里,我创建了一个表来保存磁盘上的数据的临时副本 ========
CREATE TABLE trades
(symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
这里我将磁盘上的数据复制到表 BTW 中,这里的数据按符号聚类并按时间排序。我似乎无法让 Hive 使用这个概念......即避免再次排序
LOAD DATA LOCAL INPATH '%(dir)s2010-05-07'
INTO TABLE trades
partition (dt='2010-05-07');
我使用下面的最终表来强制执行分桶 =========== 并强制排序 ===========
CREATE TABLE alltrades
(symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
数据从配置单元表加载 ==========
insert overwrite table alltrades
select symbol, exchange, price, volume, cond, bid, ask, time
from trades
distribute by symbol sort by symbol, time;
令人失望的是,关于所有交易的任何查询都需要排序符号,时间会重新排序......有没有办法解决这个问题?另外,有没有办法让整个过程在 1 个查询步骤而不是 2 个查询步骤中工作?
为什么排序似乎不起作用=======
请注意,该表是使用 sort by 子句构建和填充的。我担心删除这些会导致未来的减速器表现得好像不需要排序一样。
这是我认为不应该涉及排序的查询计划......但实际上确实如此。========
hive> explain select symbol, time, price from alltrades sort by symbol, time;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades)))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL
time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol))
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time)))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
alltrades
TableScan
alias: alltrades
Select Operator
expressions:
expr: symbol
type: string
expr: time
type: string
expr: price
type: float
outputColumnNames: _col0, _col1, _col2
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: string
sort order: ++
tag: -1
value expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: float
Reduce Operator Tree:
Extract
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1