hadoop - 使用 Hadoop 流发送精确的二进制序列

Question

我需要拆分（根据某些逻辑）并分发给映射器的二进制文件集。我为此使用Hadoop 流。主要问题是通过线路发送确切的二进制块而不改变它们。事实证明，发送原始字节并非易事。

为了更好地说明问题，我编写了一个非常简单的扩展RecordReader类，它应该从拆分中读取一些字节并发送它们。二进制数据可以包含任何内容（包括换行符）。以下是next()可能阅读的内容：

public class MyRecordReader implements
        RecordReader<BytesWritable, BytesWritable> {
    ...
    public boolean next(BytesWritable key, BytesWritable ignore)
            throws IOException {
        ...

        byte[] result = new byte[8];
        for (int i = 0; i < result.length; ++i)
            result[i] = (byte)(i+1);
        result[3] = (byte)'\n';
        result[4] = (byte)'\n';

        key.set(result, 0, result.length);
        return true;
    }
}

在这种情况下，每个next()函数调用都应将以下字节序列写入标准输入：01 02 03 0a 0a 06 07 08。如果我使用类型化字节（Hadoop-1722），那么序列应该以总共五个字节为前缀，第一个字节是序列的类型（0 表示字节），其他四个字节表示大小。所以序列应该看起来完全像这样：00 00 00 00 08 01 02 03 0a 0a 06 07 08.

我对其进行了测试/bin/cat以验证结果，命令如下：

hadoop jar <streaming jar location>
  -libjars <my input format jar>
  -D stream.map.input=typedbytes
  -mapper /bin/cat
  -inputformat my.input.Format

用来hexdump查看传入的密钥我得到了这个：。正如您所看到的，每个（换行符）都以（tab）为前缀，但是键入的字节给出了（以前）关于字节序列的类型和大小的正确信息。00 00 00 00 08 01 02 03 09 0a 09 0a 06 07 080a09

这给使用其他语言编写映射器带来了一个严重的问题，因为字节会在途中发生变化。

似乎无法保证字节将完全按原样发送，除非还有另一个我遗漏了什么？

score 0 · Accepted Answer

由于hadoop-user 邮件列表中的一个非常有用的提示，我找到了解决这个问题的方法。

简而言之，我们需要重写 Hadoop IO 向/从标准流写入/读取数据的方式。去做这个：

Extend InputWriter, OutputReader, 还提供您自己的InputFormat,OutputFormat以便您完全控制字节写入和读取流的方式。
扩展IdentifierResolver类以告诉 Hadoop 使用您自己的InputWriter和OutputReader.

使用您的IdentifierResolver、InputFormat和OuputFormat，如下所示：

hadoop jar <streaming jar location>
-D stream.io.identifier.resolver.class=my.own.CustomIdentifierResolver
-libjars <my input format jar>
-mapper /bin/cat
-inputformat my.own.CustomInputFormat
-outputformat my.own.CustomOutputFormat
<other options ...>

特性（未合并）MAPREDUCE-5018中提供的补丁是有关如何执行此操作的重要来源，并且可以根据自己的需要进行定制。

hadoop - 使用 Hadoop 流发送精确的二进制序列

1 回答 1

Related

Reference