我在 DynamoDB 数据库中有大约 35GB(2200 万行)的网络点击数据。我可以通过按键提取数据就好了。我现在正在尝试使用 Hive 来计算该数据的聚合,并且即使是基本的东西也无法正常工作。
我的 DynamoDB 设置为读取吞吐量为 40。我的 EMR 设置有一个 m1.small 主服务器和三个 m1.large 核心。我在 Hive 中执行以下操作:
SET dynamodb.throughput.read.percent=1.0;
CREATE EXTERNAL TABLE AntebellumHive (user_id string, session_time string, page_count string, custom_os string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "AntebellumClickstream",
"dynamodb.column.mapping" = "user_id:user_id,session_time:session_time,page_count:x-page-count,custom_os:x-custom-os");
select count(*)
from AntebellumHive
WHERE session_time > "2012/08/14 11:48:00.210 -0400"
AND session_time < "2012/08/14 12:48:00.210 -0400";
所以,我映射了四列(包括 user_id 键和 session_time 范围字段,以及其他两件事)。然后我只是想计算一个小时的数据中的行数,应该是数百个。
这是输出:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201212031719_0002, Tracking URL = http://ip-xxxxx.ec2.internal:9100/jobdetails.jsp?jobid=job_201212031719_0002
Kill Command = /home/hadoop/bin/hadoop job -Dmapred.job.tracker=x.x.x.x:9001 -kill job_201212031719_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2012-12-03 19:13:58,988 Stage-1 map = 0%, reduce = 0%
2012-12-03 19:14:59,415 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:00,423 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:01,435 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:02,441 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:04,227 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:05,233 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:06,255 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:07,263 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:08,269 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:09,275 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:10,290 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:11,296 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 4.5 sec
(被屏蔽的 IP。)每分钟左右,我就会获得另一秒钟的 CPU 时间,但 map% 永远不会从零开始增加,即使在 20 分钟或更长时间之后,它也永远不会完成。我绝对可以在 Dynamo 和 EMR 的监控图中看到发生的事情。
我究竟做错了什么?谢谢!