mapreduce - Is there a distributed data processing pipeline framework, or a good way to organize one?

Question

I am designing an application that requires of a distributed set of processing workers that need to asynchronously consume and produce data in a specific flow. For example:

Component A fetches pages.
Component B analyzes pages from A.
Component C stores analyzed bits and pieces from B.

There are obviously more than just three components involved.

Further requirements:

Each component needs to be a separate process (or set of processes).
Producers don't know anything about their consumers. In other words, component A just produces data, not knowing which components consume that data.

This is a kind of data flow solved by topology-oriented systems like Storm. While Storm looks good, I'm skeptical; it's a Java system and it's based on Thrift, neither of which I am a fan of.

I am currently leaning towards a pub/sub-style approach which uses AMQP as the data transport, with HTTP as the protocol for data sharing/storage. This means the AMQP queue model becomes a public API — in other words, a consumer needs to know which AMQP host and queue that the producer uses — which I'm not particularly happy about, but it might be worth the compromise.

Another issue with the AMQP approach is that each component will have to have very similar logic for:

Connecting to the queue
Handling connection errors
Serializing/deserializing data into a common format
Running the actual workers (goroutines or forking subprocesses)
Dynamic scaling of workers
Fault tolerance
Node registration
Processing metrics
Queue throttling
Queue prioritization (some workers are less important than others)

…and many other little details that each component will need.

Even if a consumer is logically very simple (think MapReduce jobs, something like splitting text into tokens), there is a lot of boilerplate. Certainly I can do all this myself — I am very familiar with AMQP and queues and everything else — and wrap all this up in a common package shared by all the components, but then I am already on my way to inventing a framework.

Does a good framework exist for this kind of stuff?

Note that I am asking specifically about Go. I want to avoid Hadoop and the whole Java stack.

Edit: Added some points for clarity.

score 1 · Accepted Answer

因为 Go 有 CSP 通道，我建议 Go 提供了一个特殊的机会来实现一个简单、简洁但完全通用的并行框架。应该可以用更少的代码比大多数现有框架做得更好。Java 和 JVM 不能有这样的东西。

它只需要使用可配置的 TCP 传输来实现通道。这将包括

一个写入通道端 API，包括用于读取端的预期服务器的一些通用规范
一个读取通道端API，包括监听端口配置和支持select
编组/解编组胶水以传输数据 - 可能是编码/gob

这种框架的成功验收测试应该是使用通道的程序应该可以跨多个处理器进行分割，并且保持相同的功能行为（即使性能不同）。

Go中有很多现有的传输层网络项目。值得注意的是ZeroMQ (0MQ) ( gozmq , zmq2 , zmq3 )。

score 0 · Accepted Answer

我知道您想避免使用 Hadoop+Java，但与其花时间开发自己的框架，不如看看Cascading。它为底层 MapReduce 作业提供了一层抽象。

在 Wikipedia 上进行了最佳总结，它 [Cascading]遵循“源-管道-汇”范式，其中数据从源捕获，遵循执行数据分析过程的可重用“管道”，结果存储在输出文件或“汇”中. 管道的创建独立于它们将处理的数据。一旦绑定到数据源和接收器，它就被称为“流”。这些流可以分组为“级联”，流程调度程序将确保给定的流在其所有依赖项都得到满足之前不会执行。管道和流可以重复使用和重新排序，以支持不同的业务需求。

您可能还想看看他们的一些示例，Log Parser、Log Analysis、TF-IDF（尤其是这个流程图）。

score 0 · Accepted Answer

我猜您正在寻找消息队列，例如beanstalkd、RabbitMQ或ØMQ（发音为 zero-MQ）。所有这些工具的本质是它们为 FIFO（或非 FIFO）队列提供了 push/receive 方法，有些甚至有 pub/sub。

因此，一个组件将数据放入队列中，另一个组件读取。这种方法在添加或删除组件以及放大或缩小每个组件时非常灵活。

这些工具中的大多数已经拥有 Go（ØMQ 在 Gophers 中非常流行）和其他语言的库，因此您的开销代码非常少。只需导入一个库并开始接收和推送消息。

为了减少这种开销并避免对特定 API 的依赖，您可以编写一个瘦包，它使用其中一个消息队列系统来提供非常简单的推送/接收调用，并在您的所有工具中使用这个包。

mapreduce - Is there a distributed data processing pipeline framework, or a good way to organize one?

3 回答 3

Related

Reference