spring - 在单个 Hadoop 节点上连接一个 Hadoop Jobfactorybean、多个 Reducer

Question

我想要达到的目标：

我已经设置了一个包含 Hadoop 任务的 Spring Batch Job 来处理一些较大的文件。要让多个 Reducer 运行该作业，我需要使用setNumOfReduceTasks设置 Reducer 的数量。我正在尝试通过JobFactorybean设置它。

我在 classpath:/META-INF/spring/batch-common.xml 中的 bean 配置：

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:p="http://www.springframework.org/schema/p"
    xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">

    <bean id="jobFactoryBean" class="org.springframework.data.hadoop.mapreduce.JobFactoryBean" p:numberReducers="5"/>
    <bean id="jobRepository" class="org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean" />
    <bean id="transactionManager" class="org.springframework.batch.support.transaction.ResourcelessTransactionManager"/>
    <bean id="jobLauncher" class="org.springframework.batch.core.launch.support.SimpleJobLauncher" p:jobRepository-ref="jobRepository" />
</beans>

XML 通过以下方式包含：

    <?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:context="http://www.springframework.org/schema/context"
    xsi:schemaLocation="
        http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.0.xsd
        http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.0.xsd">

    <context:property-placeholder location="classpath:batch.properties,classpath:hadoop.properties"
            ignore-resource-not-found="true" ignore-unresolvable="true" />


    <import resource="classpath:/META-INF/spring/batch-common.xml" />
    <import resource="classpath:/META-INF/spring/hadoop-context.xml" />
    <import resource="classpath:/META-INF/spring/sort-context.xml" />

</beans>

我正在通过以下方式获取 jUnit 测试的 bean

    JobLauncher launcher = ctx.getBean(JobLauncher.class);
    Map<String, Job> jobs = ctx.getBeansOfType(Job.class);
    JobFactoryBean jfb = ctx.getBean(JobFactoryBean.class);

jUnit 测试因错误而停止：

No bean named '&jobFactoryBean' is defined

所以： JobFactoryBean 没有加载，但是其他的加载正确并且没有错误。

没有线

JobFactoryBean jfb = ctx.getBean(JobFactoryBean.class);

项目测试运行，但每个作业只有一个 Reducer。

方法

ctx.getBean("jobFactoryBean");

返回一个 Hadoop 作业。我希望能在那里得到 factoryBean ......

为了测试它，我扩展了 Reducer 的构造函数以记录 Reducer 的每次创建，以便在生成时收到通知。到目前为止，我只在日志中得到一个条目。

我有一个 2 个虚拟机，每个虚拟机有 2 个分配的内核和 2 GB 内存，我正在尝试对一个 75MB 的文件进行排序，该文件由古腾堡项目的多本书组成。

编辑：

我尝试过的另一件事是通过属性设置 hadoop 作业中减速器的数量，但没有结果。

<job id="search-jobSherlockOk" input-path="${sherlock.input.path}"
    output-path="${sherlockOK.output.path}"
    mapper="com.romediusweiss.hadoopSort.mapReduce.SortMapperWords"
    reducer="com.romediusweiss.hadoopSort.mapReduce.SortBlockReducer"
    partitioner="com.romediusweiss.hadoopSort.mapReduce.SortPartitioner"
    number-reducers="2"
    validate-paths="false" />

mapreduce-site.xml 中的设置在两个节点上：

<property>
        <name>mapred.tasktracker.reduce.tasks.maximum</name>
        <value>10</value>
</property>

...以及为什么：

我想复制以下博客文章的示例： http ://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

我需要在同一台机器或完全分布式环境上使用不同的 Reducer 来测试 Partitioner 的行为。第一种方法会更容易。

ps: 有较高声望的用户能否创建一个标签“spring-data-hadoop” 谢谢！

score 1 · Accepted Answer

Answered the question on the Spring forums where it was also posted (recommend using it for Spring Data Hadoop questions).

The full answer is here http://forum.springsource.org/showthread.php?130500-Additional-Reducers , but in short, the number of reducers is driven by the number of input splits. See http://wiki.apache.org/hadoop/HowManyMapsAndReduces

spring - 在单个 Hadoop 节点上连接一个 Hadoop Jobfactorybean、多个 Reducer

1 回答 1

Related

Reference