hadoop - 从象鸟中写入可以被 ProtobufPigLoader 读取的数据

Question

对于我的一个项目，我想分析大约 2 TB 的Protobuf对象。我想通过“象鸟”库在猪脚本中使用这些对象。但是，我并不完全清楚如何将文件写入 HDFS，以便 ProtobufPigLoader 类可以使用它。

这就是我所拥有的：

猪脚本：

  register ../fs-c/lib/*.jar // this includes the elephant bird library
  register ../fs-c/*.jar    
  raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');

导入工具（部分）：

def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
  val conf = new Configuration()
  val fs = FileSystem.get(filenamePath.toUri(), conf)
  val os = fs.create(filenamePath, true)
  val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
  return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()

导入工具运行良好。我在使用 ProtobufPigLoader 时遇到了一些问题，因为我无法使用 hadoop-lzo 压缩库，并且没有修复（请参阅此处）ProtobufPigLoader 无法正常工作。我遇到的问题是DUMP raw_data;返回Unable to open iterator for alias raw_data和ILLUSTRATE raw_data;返回No (valid) input data found!。

对我来说，ProtobufPigLoader 似乎无法读取 ProtobufBlockWriter 数据。但是用什么代替呢？如何将外部工具中的数据写入 HDFS，以便 ProtobufPigLoader 对其进行处理。

替代问题：改用什么？如何将相当大的对象写入 Hadoop 以使用 Pig 使用它？对象不是很复杂，但在列表中包含大量子对象（Protobuf 中的重复字段）。

我想避免使用任何文本格式或 JSON，因为它们对于我的数据来说太大了。我希望它会使数据膨胀 2 或 3 倍（很多整数，很多我需要编码为 Base64 的字节字符串）..
我想避免对数据进行规范化，以便将主对象的 id 附加到每个子对象（这是现在所做的），因为这也会增加空间消耗并使后续处理中需要连接。

更新：

我没有使用protobuf loader类的生成，而是使用反射类型的loader
protobuf 类位于已注册的 jar 中。DESCRIBE正确显示类型。

hadoop - 从象鸟中写入可以被 ProtobufPigLoader 读取的数据

0 回答 0

Related

Reference