hadoop - 我是否必须迭代图像文件路径作为自定义 RecordReader 中 nextkeyvalue() 中的键才能读取许多图像文件？

Question

我正在尝试从 hdfs 读取图像。我已经编写了自己的 imageinputformat 和 imageRecordReader 的自定义实现。

在 ImageRecordReader 中，在 nextkeyvalue 方法（）中——这是一个 RecordReader API，我是否必须指定用于读取图像的 for 循环，例如。for(Path path:paths) 等，因为我将输入表单 HDFS 指定为图像目录。或者它会自己读取图像，因为图像被分成各种地图任务，每个地图都会得到图像。

我在这里有点困惑。我必须在方法 initialize 或 nextkeyvalue() 中使用 for 循环吗？如果是的话，我应该在哪里使用它——在 intialize() 方法和 nextkeyvalue() 中？（检查上面的链接方法详细信息）。

score 0 · Accepted Answer

你为什么不直接写一个SequenceFilewith<Text,BytesWritable>而不是实现你自己的格式呢？

一些随机图像的示例，您应该将路径存储在yourImagePaths：

// omitted try / catch and finally statements
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path output = new Path("/tmp/out.seq");

List<String> yourImagePaths = new LinkedList<>();
    // TODO fill your image paths here
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, output,
    Text.class, BytesWritable.class);

for (String file : yourImagePaths) {
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    org.apache.hadoop.io.IOUtils.copyBytes(fs.open(new Path(file)), out, conf);
    writer.append(new Text(file), new BytesWritable(out.toByteArray()));
}

writer.close();

基本上它将路径写为键（以识别你的图像）和图像中的原始字节作为值。

现在您可以在 Hadoop 作业中读取它，它会自动被拆分。您只需要说输入键是Text，值是BytesWritable并且 SequenceFileInputFormat必须使用。

hadoop - 我是否必须迭代图像文件路径作为自定义 RecordReader 中 nextkeyvalue() 中的键才能读取许多图像文件？

1 回答 1

Related

Reference