Suppose I have a 3 node EKS cluster made up of 3 spot instances (we'll call them Node A, B, and C), and each node has critical pods scheduled. The EKS cluster has the EKS Node Termination Handler running. Metadata gets posted saying that in 2 minutes Node A is going to be reclaimed by Amazon.
The Node Termination handler cordons and drains the node being taken (Node A), and a new node spins up. The pods from Node A are then scheduled on the Node A Replacement. If this completes in two minutes time, perfect.
Is there a benefit to having spare capacity around (Node D). If Node A is taken back by Amazon, will my pods be rescheduled on Node D since it is already available?
In this architecture, it seems like a great idea to have a spare node or two around for pod rescheduling so I don't have a risk of the 2 minute window. Do I need to do anything special to make sure the pods are rescheduled in the most efficient way?