java - JAVA中的工作负载分配/并行执行

Question

我在这里遇到一种情况，我需要将工作分配给在不同 JVM 中运行的多个 JAVA 进程，可能是不同的机器。

假设我有一个记录 1 到 1000 的表。我正在寻找要收集和分发的工作是 10 组。假设记录 1-10 到 workerOne。然后将 11-20 记录到 workerThree。等等等等。毋庸置疑，workerOne 永远不会做 workerTwo 的工作，除非且直到 workerTwo 无法做到。

这个例子纯粹基于数据库，但可以扩展到任何系统，我相信它是文件处理、电子邮件处理等等。

我有一种小小的感觉，即立即的反应是采用 Master/Worker 方法。然而，这里我们谈论的是不同的 JVM。即使一个 JVM 出现故障，另一个 JVM 也应该继续工作。

现在百万美元的问题是：是否有任何好的框架（生产就绪）可以让我有能力做到这一点。即使有特定需求的具体实现，如数据库记录、文件处理、电子邮件处理等。

我已经看过 Java Parallel Execution Framework，但不确定它是否可以用于不同的 JVM，如果其中一个掉下来，另一个是否会继续运行。我相信 Workers 可以在多个 JVM 上，但是 Master 呢？

更多信息 1：由于 JDK 1.6 的要求，Hadoop 会成为一个问题。这有点过分了。

谢谢，富兰克林

score 2 · Accepted Answer

2

可能想研究MapReduce和Hadoop

于 2009-06-24T17:43:23.027 回答

score 1 · Accepted Answer

1

查看Hadoop

于 2009-06-24T17:43:23.200 回答

score 1 · Accepted Answer

You could also use message queues. Have one process that generates the list of work and packages it in nice little chunks. It then plops those chunks on a queue. Each one of the workers just keeps waiting on the queue for something to show up. When it does, the worker pulls a chunk off the queue and processes it. If one process goes down, some other process will pick up the slack. Simple and people have been doing it that way for a long time so there's a lot information about it on the net.

score 1 · Accepted Answer

I believe Terracotta can do this. If you are dealing with web pages, JBoss can be clustered.

If you want to do this yourself you will need a work manager which keeps track of jobs to do, jobs in progress and jobs never done which needs to be rescheduled. The workers then ask for something to do, do it, and send the result back, asking for more.

You may want to elaborate on what kind of work you want to do.

score 1 · Accepted Answer

The problem you've described is definitely best solved using the master/worker pattern.

You should have a look into JavaSpaces (part of the Jini framework), it's really well suited to this kind of thing. Basically you just want to encapsulate each task to be carried out inside a Command object, subclassing as necesssary. Dump these into the JavaSpace, let your workers grab and process one at a time, then reassemble when done.

Of course your performance gains will totally depend on how long it takes you to process each set of records, but JavaSpaces won't cause any problems if distributed across several machines.

score 0 · Accepted Answer

如果您处理单个数据库中的记录，请考虑使用存储过程在数据库本身内执行工作。在不同机器上处理记录的收益可能会被在数据库和计算节点之间检索和传输工作的成本所抵消。

对于文件处理，可能是类似的情况。处理（共享）文件系统中的文件可能会给操作系统带来很大的 I/O 压力。

在多台机器上维护多个 JVM 的成本也可能是多余的。

And for the question: I used the JADE (Java Agent Development Environment) for some distributed simulation once. Its multi-machine suppord and message passing nature might help you.

score 0 · Accepted Answer

I would consider using Jgroups for that. You can cluster your jvms and one of your nodes can be selected as master and then can distribute the work to the other nodes by sending message over network. Or you can already partition your work items and then manage in master node the distribution of the partitions like partion-1 one goes to JVM-4 , partion-2 goes to JVM-3, partion-3 goes to JVM-2 and so on. And if JVM-4 goes down it will be realized by the master node and then master node will tell to one of the other nodes to start pick up partition-1 as well. One other alternative which is easier to use is redis pub sub support. http://redis.io/topics/pubsub . But then you will have to maintain redis servers which i dont like.

java - JAVA中的工作负载分配/并行执行

7 回答 7

Related

Reference