你为什么不直接写一个SequenceFile
with<Text,BytesWritable>
而不是实现你自己的格式呢?
一些随机图像的示例,您应该将路径存储在yourImagePaths
:
// omitted try / catch and finally statements
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path output = new Path("/tmp/out.seq");
List<String> yourImagePaths = new LinkedList<>();
// TODO fill your image paths here
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, output,
Text.class, BytesWritable.class);
for (String file : yourImagePaths) {
ByteArrayOutputStream out = new ByteArrayOutputStream();
org.apache.hadoop.io.IOUtils.copyBytes(fs.open(new Path(file)), out, conf);
writer.append(new Text(file), new BytesWritable(out.toByteArray()));
}
writer.close();
基本上它将路径写为键(以识别你的图像)和图像中的原始字节作为值。
现在您可以在 Hadoop 作业中读取它,它会自动被拆分。您只需要说输入键是Text
,值是BytesWritable
并且 SequenceFileInputFormat
必须使用。