是否可以使用 Google 的 Dataflow 服务运行 Hadoop MR 作业?
我有几个Hadoop MR 作业,我希望能够在 Dataflow 服务上运行。我希望能够利用 Dataflow 服务,而不必完全重写我的 Hadoop 作业。
是否可以使用 Google 的 Dataflow 服务运行 Hadoop MR 作业?
我有几个Hadoop MR 作业,我希望能够在 Dataflow 服务上运行。我希望能够利用 Dataflow 服务,而不必完全重写我的 Hadoop 作业。
To make migration easier, I think it should be possible to define a generic Dataflow Transform which could wrap Hadoop Mappers and Reducers so the code could be reused in Dataflow Pipelines.
Here is a very minimal implementation AvroMRTransform that acts as a wrapper for AvroMapper and AvroReducer (i.e. it can only be used for inputs and outputs which are Avro data).
AvroMRTransform works but there are almost certainly cases it doesn't handle. It also doesn't support a bunch of Hadoop features such as counters.
For these reasons, I wouldn't recommend this as anything other than a temporary stop gap measure (e.g. your application contains many MR jobs and you don't want to rewrite them all at once).
The Hadoop MR API strikes me as being very large so ultimately supporting every feature using Dataflow is probably going to be more work then just rewriting your application.