I am designing an application that requires of a distributed set of processing workers that need to asynchronously consume and produce data in a specific flow. For example:
- Component A fetches pages.
- Component B analyzes pages from A.
- Component C stores analyzed bits and pieces from B.
There are obviously more than just three components involved.
Further requirements:
- Each component needs to be a separate process (or set of processes).
- Producers don't know anything about their consumers. In other words, component A just produces data, not knowing which components consume that data.
This is a kind of data flow solved by topology-oriented systems like Storm. While Storm looks good, I'm skeptical; it's a Java system and it's based on Thrift, neither of which I am a fan of.
I am currently leaning towards a pub/sub-style approach which uses AMQP as the data transport, with HTTP as the protocol for data sharing/storage. This means the AMQP queue model becomes a public API — in other words, a consumer needs to know which AMQP host and queue that the producer uses — which I'm not particularly happy about, but it might be worth the compromise.
Another issue with the AMQP approach is that each component will have to have very similar logic for:
- Connecting to the queue
- Handling connection errors
- Serializing/deserializing data into a common format
- Running the actual workers (goroutines or forking subprocesses)
- Dynamic scaling of workers
- Fault tolerance
- Node registration
- Processing metrics
- Queue throttling
- Queue prioritization (some workers are less important than others)
…and many other little details that each component will need.
Even if a consumer is logically very simple (think MapReduce jobs, something like splitting text into tokens), there is a lot of boilerplate. Certainly I can do all this myself — I am very familiar with AMQP and queues and everything else — and wrap all this up in a common package shared by all the components, but then I am already on my way to inventing a framework.
Does a good framework exist for this kind of stuff?
Note that I am asking specifically about Go. I want to avoid Hadoop and the whole Java stack.
Edit: Added some points for clarity.