hadoop - 在 HDFS 中粉碎小文件

Question

我们针对 CDH5 在 Mesos 0.17 上运行 Spark 0.9.1。到目前为止，我们一直在使用 CDH 系列的“mr1”版本，以便我们可以在较小的文件上运行filecrush项目。出于各种原因，我们希望能够自由升级到 MR-2。

在 Hadoop 的 map/reduce 之外是否存在任何工具来执行此操作？我们今天使用的 filecrush 库并不简单，因此将模式转换为 Spark 似乎并不简单。

score 0 · Accepted Answer

MR1 code usually works with no changes (or very few) with a recompile against MR2 libraries. Does that not work? This is probably quite straightforward.

You wouldn't translate this quite directly to Spark but you can probably achieve a similar effect quite easily by mapping a bunch of files and outputting the result with a different partitioning. You may just run into the same issues as Spark is going to use HDFS and its InputFormats to read your data into splits, and that is kinda where your problem is coming from to begin with.

hadoop - 在 HDFS 中粉碎小文件

1 回答 1

Related

Reference