1

I'm using the mongo-hadoop adapter to run map/reduce jobs. everything is fine except the launch time and the time taken by the job. Even when the dataset is very small, the map time is 13 seconds and reduce time is 12 seconds. In fact I have changed settings in mapred-site.xml and core-site.xml. but the time taken for map/reduce seems to be constant. is there any way i can reduce it. I also explored the optimized hadoop distribution from hanborq. they use a worker pool for faster job launch/setup. is there any equivalent available elsewhere as the hanborq distribution is not very active. it was updated 4 months ago and is built on an older version of hadoop.

some of my settings are as follows: mapred-site.xml:

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xms1g</value>
</property>
<property>
    <name>mapred.sort.avoidance</name>
    <value>true</value>
</property>
 <property>
      <name>mapred.job.reuse.jvm.num.tasks</name>
          <value>-1</value>
 </property>
<property>
     <name>mapreduce.tasktracker.outofband.heartbeat</name>
     <value>true</value>
</property>
   <property>
       <name>mapred.compress.map.output</name>
       <value>false</value>
   </property>

core-site.xml:

<property>
          <name>io.sort.mb</name>
          <value>300</value>
    </property>
<property>
    <name>io.sort.factor</name>
    <value>100</value>
</property>

Any help would be greatly appreciated. thanks in advance.

4

1 回答 1

1

由于心跳导致部分延迟。任务跟踪器向作业跟踪器发送心跳,让它知道他们还活着,但作为心跳的一部分,他们还宣布他们有多少打开的地图和减少槽。作为响应,JT 将工作分配给 TT 执行。这意味着当您提交作业时,TT 只能以心跳的速度获得任务(每 2 - 4 秒,给予或接受)。此外,JT(默认情况下)在每个心跳期间只分配一个任务。这意味着如果您只有一个 TT,即使 TT 有额外的容量,您也只能每 2 - 4 秒分配 1 个任务。

这样你就可以:

  1. 缩短两次心跳之间的持续时间。

  2. 从 TaskTracker 更改任务调度程序对每个心跳的工作方式。mapred.fairscheduler.assignmultiple

于 2013-04-18T07:01:49.787 回答