我在 HDP2 集群上运行 Hive 0.14。我的数据集是使用 kite sdk 构建的,并使用外部表注册到 Hive。
请参阅下面的表格布局:
hive> describe hivetweets;
OK
created_at bigint from deserializer
id bigint from deserializer
in_reply_to_user_id bigint from deserializer
in_reply_to_status_id bigint from deserializer
lang string from deserializer
text string from deserializer
retweet_count int from deserializer
year int Partition column derived from 'created_at' column, generated by Kite.
month int Partition column derived from 'created_at' column, generated by Kite.
day int Partition column derived from 'created_at' column, generated by Kite.
hour int Partition column derived from 'created_at' column, generated by Kite.
# Partition Information
# col_name data_type comment
year int Partition column derived from 'created_at' column, generated by Kite.
month int Partition column derived from 'created_at' column, generated by Kite.
day int Partition column derived from 'created_at' column, generated by Kite.
hour int Partition column derived from 'created_at' column, generated by Kite.
Time taken: 0.15 seconds, Fetched: 19 row(s)
我对此设置的初始测试查询是只获取数据集的一行(我在示例中删除了实际输出):
hive> select * from hivetweets limit 1;
OK
Time taken: 103.726 seconds, Fetched: 1 row(s)
运行此查询的 104 秒太长了。
这可能没有分布式运行,因此我尝试使用更多数据对其进行测试:
hive> select count(*) from hivetweets limit 100000;
Query ID = root_20150715132222_81e386ef-2990-4251-a61f-82ca8da4c48d
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1436910684121_0006)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 19 19 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 567.52 s
--------------------------------------------------------------------------------
OK
197371741
在 10 分钟内计算 10 万条记录是合理的。
我对如何调试它的任何建议感到满意。