0

我在 DynamoDB 数据库中有大约 35GB(2200 万行)的网络点击数据。我可以通过按键提取数据就好了。我现在正在尝试使用 Hive 来计算该数据的聚合,并且即使是基本的东西也无法正常工作。

我的 DynamoDB 设置为读取吞吐量为 40。我的 EMR 设置有一个 m1.small 主服务器和三个 m1.large 核心。我在 Hive 中执行以下操作:

SET dynamodb.throughput.read.percent=1.0;

CREATE EXTERNAL TABLE AntebellumHive (user_id string, session_time string, page_count string, custom_os string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("dynamodb.table.name" = "AntebellumClickstream", 
"dynamodb.column.mapping" = "user_id:user_id,session_time:session_time,page_count:x-page-count,custom_os:x-custom-os"); 

select count(*)
from AntebellumHive
WHERE session_time > "2012/08/14 11:48:00.210 -0400"
  AND session_time < "2012/08/14 12:48:00.210 -0400";

所以,我映射了四列(包括 user_id 键和 session_time 范围字段,以及其他两件事)。然后我只是想计算一个小时的数据中的行数,应该是数百个。

这是输出:

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201212031719_0002, Tracking URL = http://ip-xxxxx.ec2.internal:9100/jobdetails.jsp?jobid=job_201212031719_0002
Kill Command = /home/hadoop/bin/hadoop job  -Dmapred.job.tracker=x.x.x.x:9001 -kill job_201212031719_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2012-12-03 19:13:58,988 Stage-1 map = 0%,  reduce = 0%
2012-12-03 19:14:59,415 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:00,423 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:01,435 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:02,441 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:04,227 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:05,233 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:06,255 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:07,263 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:08,269 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:09,275 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:10,290 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec
2012-12-03 19:15:11,296 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 4.5 sec

(被屏蔽的 IP。)每分钟左右,我就会获得另一秒钟的 CPU 时间,但 map% 永远不会从零开始增加,即使在 20 分钟或更长时间之后,它也永远不会完成。我绝对可以在 Dynamo 和 EMR 的监控图中看到发生的事情。

我究竟做错了什么?谢谢!

4

1 回答 1

3

如果我正确阅读了您的帖子,则您有 35 GB 的数据,并且您正在尝试使用 40 读取 IOPS 读取数据。40 IOPS 大致转换为 40 KBPS 进行扫描。这意味着完成查询大约需要 254 小时。

一旦一个或多个映射器完成处理,Hive 就会更新查询百分比。由于创建的每个映射器都可能需要很长时间才能运行,因此您不会很快看到 Hive 更新。

您可以登录到主节点上的 Hadoop UI 并查看 Hadoop 统计信息。它将向您显示各个地图任务的状态以及有关读取数据的一些统计信息。请参考文档:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingtheHadoopUserInterface.html

于 2012-12-04T01:25:13.087 回答