akka.net - Node doesn't rejoin cluster after being downed

Question

I'm using Akka.NET's cluster (1.0.5) functionality to implement a service that consists of a master node that receives requests over HTTP and farms the work out to worker nodes that have joined the cluster.

The idea is to be able to easily accomplish the following:

add worker nodes to cluster when demand is high (check)
be able to reboot the master node or take it offline (maintentance/misbehaviour/whatever) and have the workers reconnect when it becomes available (check)
upgrade/reboot a misbehaving worker and have it reconnect to the master node (fail!)

The first point works as you'd expect: a new instance (Azure Cloud Service worker role) is spun up, and joins the master - which is also the seed node.

For the second point, all worker nodes have an actor that listens to cluster gossip and it determines if the master node has died. If this is the case, the worker node actor system will be rebooted.

The last point is where I'm stuck. The master node also listens to cluster gossip to determine when a worker has become unreachable (ClusterEvent.UnreachableMember) or is shutting down (Exiting status) and decides if it should be downed. According to what I've understood from documentation, the only way to have a "new" version of the same node rejoin the cluster is to down the old version first.

Unfortunately this doesn't seem to be happening. In the test scenario I ran to reproduce the problem locally in the compute emulator, these were the steps:

Start the master node (port 8090)
Start the worker node (port 9090)
Do some work
Kill the worker node abruptly
Start the worker node back up

Below are relevant snippets from the logs I collected for both nodes during this test:

Master:

Worker becomes unreachable:

[WARNING][07/12/2015 20:39:35][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] Cluster Node [akka.tcp://InventoryService@127.0.0.1:8090] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://InventoryService@0.0.0.0:9090, status = Up]

Master node calls Cluster.Leave() and Cluster.Down() on the worker's address:

[DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Leave
[INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marked address [akka.tcp://InventoryService@0.0.0.0:9090] as Leaving]
[DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Down
[INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marking unreachable node [akka.tcp://InventoryService@0.0.0.0:9090] as Down
[DEBUG][07/12/2015 20:39:35][Thread 0020][[akka://InventoryService/system/cluster/core/daemon/heartbeatSender]] Cluster Node [akka.tcp://InventoryService@127.0.0.1:8090] - Heartbeat to [akka.tcp://InventoryService@0.0.0.0:9090]
[INFO][07/12/2015 20:39:36][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Leader is removing unreachable node [akka.tcp://InventoryService@0.0.0.0:9090]

Master confirms the old node will no longer be allowed to join (seems to have a bug though, see the first line - gated instead for akka.tcp://InventoryService@0.0.0.0:9090 ms, which I imagine would be the time it is supposed to be gated):

[WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService@0.0.0.0:9090] with unknown UID is reported as quarantined, but address cannot be quarantined without knowing the UID, gated instead for akka.tcp://InventoryService@0.0.0.0:9090 ms
[DEBUG][07/12/2015 20:39:36][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%400.0.0.0%3a9090-2/endpointWriter]] Disassociated [akka.tcp://InventoryService@127.0.0.1:8090] -> akka.tcp://InventoryService@0.0.0.0:9090
[DEBUG][07/12/2015 20:39:36][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Association to [akka.tcp://InventoryService@0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
[WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService@0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.

Worker boots and tries to connect to the master:

[DEBUG][07/12/2015 20:40:20][Thread 0013][remoting] Associated [akka.tcp://InventoryService@127.0.0.1:8090] <- akka.tcp://InventoryService@0.0.0.0:9090
[DEBUG][07/12/2015 20:40:21][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:21][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:23][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:28][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:33][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:38][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:43][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:48][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:53][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:58][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:03][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:08][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:13][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:18][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin

What is happening here?

Worker:

Booting back up after being killed:

[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
[DEBUG][07/12/2015 20:40:20][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:21][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%40127.0.0.1%3a8090-1/endpointWriter]] Drained buffer with maxWriteCount: 50, fullBackoffCount: 1,smallBackoffCount: 0, noBackoffCount: 0,adaptiveBackoff: 10000

And thats it...nothing else gets written to the log!

Full log files:

Master: http://pastebin.com/raw.php?i=WtjEhV1V
Worker: http://pastebin.com/raw.php?i=QGPxkqEd

Master cluster config:

cluster {
    seed-nodes = ["master's address here"]
    roles = [ InventoryServiceMaster, InventoryServiceWorker ]
    failure-detector {
        acceptable-heartbeat-pause = 5s
        threshold = 10.0
    }
}

Worker's config is the same, but only has the InventoryServiceWorker role.

What am I missing here? Is this a configuration problem? (I'm hoping its not a bug - I've seen someone else report a similar problem on Github).

EDIT:

Just to be clear, I'm not using the Akka.dll from Nuget since it contains a serialization bug - I checked ou the current master applied the fix and did a Release build. The logs have debug information because I kept the PDB from the build.

EDIT 2:

In the worker log, after rebooting, the event Akka.Cluster.InternalClusterAction+JoinSeedNodes appears twice because I originally had a manual call to Cluster.JoinSeedNodes(). I've since removed this but the result is still the same.

score 1 · Accepted Answer

这已在 Akka.NET 1.1 中得到解决 - 我们的 UID 系统在此之前没有正确实现（在本文发布时为 1.0.5），但现在可以正常工作。

akka.net - Node doesn't rejoin cluster after being downed

1 回答 1

Related

Reference