java - Hadoop - 如何收集没有值的文本输出

Question

我正在做一个 map reduce 工作，我想知道是否可以向我的输出文件发出自定义字符串。没有计数，没有其他数量，只是一团文本。

这是我所想的基本思想

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        // this map doesn't do very much
        String line = value.toString();
        word.set(line);
        // emit to map output
        output.collect(word,one);

        // but how to i do something like output.collect(word)
        // because in my output file I want to control the text 
        // this is intended to be a map only job
    }
}

这种事情可能吗？这是为了创建一个仅映射的作业来转换数据，使用 hadoop 的并行性，但不一定是整个 MR 框架。当我运行这个作业时，我会在 hdfs 中为每个映射器获得一个输出文件。

$ hadoop fs -ls /Users/dwilliams/output
2013-09-15 09:54:23.875 java[3902:1703] Unable to load realm info from SCDynamicStore
Found 12 items
-rw-r--r--   1 dwilliams supergroup          0 2013-09-15 09:52 /Users/dwilliams/output/_SUCCESS
drwxr-xr-x   - dwilliams supergroup          0 2013-09-15 09:52 /Users/dwilliams/output/_logs
-rw-r--r--   1 dwilliams supergroup    7223469 2013-09-15 09:52 /Users/dwilliams/output/part-00000
-rw-r--r--   1 dwilliams supergroup    7225393 2013-09-15 09:52 /Users/dwilliams/output/part-00001
-rw-r--r--   1 dwilliams supergroup    7223560 2013-09-15 09:52 /Users/dwilliams/output/part-00002
-rw-r--r--   1 dwilliams supergroup    7222830 2013-09-15 09:52 /Users/dwilliams/output/part-00003
-rw-r--r--   1 dwilliams supergroup    7224602 2013-09-15 09:52 /Users/dwilliams/output/part-00004
-rw-r--r--   1 dwilliams supergroup    7225045 2013-09-15 09:52 /Users/dwilliams/output/part-00005
-rw-r--r--   1 dwilliams supergroup    7222759 2013-09-15 09:52 /Users/dwilliams/output/part-00006
-rw-r--r--   1 dwilliams supergroup    7223617 2013-09-15 09:52 /Users/dwilliams/output/part-00007
-rw-r--r--   1 dwilliams supergroup    7223181 2013-09-15 09:52 /Users/dwilliams/output/part-00008
-rw-r--r--   1 dwilliams supergroup    7223078 2013-09-15 09:52 /Users/dwilliams/output/part-00009

如何在 1 个文件中获取结果？我应该使用身份缩减器吗？

score 4 · Accepted Answer

1.要实现output.collect(word)，您可以使用Class NullWritable。为此，您必须在 Mapper 中使用output.collect(word, NullWritable.get())。注意 NullWritable 是单例的。

2.如果您不想拥有多个文件，您可以将 reducer 的数量设置为 1。但这会产生额外的开销，因为这将涉及网络上的大量数据混洗。原因是，Reducer 必须从运行 Mappers 的不同机器中获取其输入。此外，所有负载都将转到一台机器上。但是如果你只想要一个输出文件，你绝对可以使用一个 mReducer。conf.setNumReduceTasks(1)应该足以实现这一点。

几个小建议：

我不建议您使用getmerge，因为它将生成的文件复制到本地 FS。因此，您必须将其复制回 HDFS 才能进一步使用它。
如果可能，请使用新的 API。

score 0 · Accepted Answer

如果它是仅映射作业，则输出文件的数量将等于映射器的数量。如果需要减速器，它将等于减速器的数量。但是您始终可以hadoop dfs -getmerge <hdfs output directory> <some file>将输出目录中的所有输出合并到一个文件中。

您可以使用输出纯文本文件TextOutputFormat，例如job.setOutputFormat(TextOutputFormat.class). 然后将map上面的方法更改为使用OutputCollector<NullWritable, Text>and output.collect(null, "some text")。这将写入some text所有记录。如果您想要制表符分隔的键值，您可以将其更改为OutputCollector<Text, Text>and output.collect("key", "some text")。这将打印key<tab>some text在输出中。

java - Hadoop - 如何收集没有值的文本输出

2 回答 2

Related

Reference