hadoop - 是否有可用于 lzo 压缩二进制数据的 Scalding 源？

Question

我正在使用Elephant Bird 的可拆分 LZO 压缩将序列化的 Thrift 记录写入文件。为了实现这一点，我正在使用他们的ThriftBlockWriter课程。然后，我的 Scalding 作业使用FixedPathLzoThrift源来处理记录。这一切都很好。问题是我仅限于单个 Thrift 类的记录。

我想开始使用RawBlockWriter而不是ThriftBlockWriter[MyThriftClass]. 因此，我的输入将是 LZO 压缩的原始字节数组，而不是 LZO 压缩的 Thrift 记录。我的问题是：我应该用什么代替FixedPathLzoThrift[MyThriftClass]？

“protocol-buffers”标签的解释：Elephant Bird 使用 Protocol BuffersSerializedBlock类来包装原始输入，如此处所示。

score 0 · Accepted Answer

我通过创建一个FixedPathLzoRaw类来代替FixedPathLzoThrift：

case class FixedPathLzoRaw(path: String*) extends FixedPathSource(path: _*) with LzoRaw

// Corresponds to LzoThrift trait
trait LzoRaw extends LocalTapSource with SingleMappable[Array[Byte]] with TypedSink[Array[Byte]] {
  override def setter[U <: Array[Byte]] = TupleSetter.asSubSetter[Array[Byte], U](TupleSetter.singleSetter[Array[Byte]])
  override def hdfsScheme = HadoopSchemeInstance((new LzoByteArrayScheme()).asInstanceOf[Scheme[_, _, _, _, _]])
}

hadoop - 是否有可用于 lzo 压缩二进制数据的 Scalding 源？

1 回答 1

Related

Reference