java - 是否可以在没有输入文件的 Hadoop 集群上运行 map/reduce 作业？

Question

当我尝试在 Hadoop 集群上运行 map/reduce 作业而不指定任何输入文件时，我得到以下异常：

 java.io.IOException: No input paths specified in job

好吧，我可以想象在没有输入文件的情况下运行作业的情况确实有意义。测试文件的生成就是这样。Hadoop可以做到这一点吗？如果没有，您是否有一些生成文件的经验？有没有更好的方法，然后在集群上保留一个记录的虚拟文件用作生成作业的输入文件？

score 1 · Accepted Answer

文件路径与基于 FileInputFormat 的输入（如 SequenceInputFormat 等）相关。但从 hbase、数据库读取的输入格式不会从文件中读取，因此您可以自己实现 InputFormat 并在 getSplits、RecordReader、createRecordReader 中定义自己的行为。为了检查，请查看 TextInputFormat 类的源代码。

score 0 · Accepted Answer

我猜你正在寻找在 samll 数据集上测试你的 map-reduce，所以在这种情况下，我会推荐以下

Map-Reduce 的单元测试将解决您的问题

如果你想从你的文件中测试你的映射器/组合器/reducer 的单行 linput，最好的办法是对每个文件使用 UnitTest。

示例代码：-
在 java 中使用 Mocking Frame 工作可以在您的 IDE 中运行这些测试用例

在这里我使用了 Mockito或者 MRunit 也可以使用，这也依赖于 Mockito（Java Mocking Framework）

public class BoxPlotMapperTest {
@Test
public void validOutputTextMapper() throws IOException, InterruptedException
{
    Mapper mapper=new Mapper();//Your Mapper Object 
    Text line=new Text("single line from input-file"); // single line input from file 
    Mapper.Context context=Mockito.mock(Mapper.Context.class);
    mapper.map(null, line, context);//(key=null,value=line,context)//key was not used in my code so its null 
    Mockito.verify(context).write(new Text("your expected key-output"), new Text("your expected value-output")); // 

}

@Test
public void validOutputTextReducer() throws IOException, InterruptedException
{
    Reducer reduer=new Reducer();
    final List<Text> values=new ArrayList<Text>();
    values.add(new Text("value1"));
    values.add(new Text("value2"));
    values.add(new Text("value3"));
    values.add(new Text("value4"));
    Iterable<Text> iterable=new Iterable<Text>() {

        @Override
        public Iterator<Text> iterator() {
            // TODO Auto-generated method stub
            return values.iterator();
        }
    };
    Reducer.Context context=Mockito.mock(Reducer.Context.class);
    reduer.reduce(new Text("key"),iterable, context);
    Mockito.verify(context).write(new Text("your expected key-output"), new Text("your expected value-output"));

}

}

score 0 · Accepted Answer

对于 MR 作业单元测试，您还可以使用MRUnit。如果您想使用 Hadoop 生成测试数据，那么我建议您查看Teragen的源代码。

score 0 · Accepted Answer

如果要生成测试文件，为什么首先需要使用 hadoop？您可以使用 mapreduce 步骤输入的任何类型的文件在 mapreduce 步骤外部使用特定于类型的 API 创建，甚至是 HDFS 文件。

score 0 · Accepted Answer

我知道我正在复活一个旧线程，但没有选择最佳答案，所以我想我会把它扔在那里。我同意 MRUnit 对很多事情都有好处，但有时我只是想玩一些真实的数据（特别是对于我需要模拟它以使其在 MRUnit 中工作的测试）。当这是我的目标时，我创建了一个单独的小作业来测试我的想法并使用 SleepInputFormat 基本上对 Hadoop 撒谎，并说有输入，而实际上没有。旧 API 在这里提供了一个示例：https ://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.22/mapreduce/src/test/mapred/org/apache/hadoop/mapreduce/ SleepJob.java，我在这里将输入格式转换为新的 API：https ://gist.github.com/keeganwitt/6053872 。

java - 是否可以在没有输入文件的 Hadoop 集群上运行 map/reduce 作业？

5 回答 5

我猜你正在寻找在 samll 数据集上测试你的 map-reduce，所以在这种情况下，我会推荐以下

Map-Reduce 的单元测试将解决您的问题

Related

Reference