google-compute-engine - 当我的虚拟机在 GCE 上被抢占时，Docker 似乎在我的关机脚本执行之前被杀死

Question

我想实现一个shutdown-script在我的 VM 将在 Google Compute Engine 上被抢占时调用的方法。该 VM 用于运行执行长时间运行的批处理的 dockers 容器，因此我向它们发送一个信号以使它们正常退出。

当我手动执行它时，该关闭脚本运行良好，但在真正的抢占用例或我自己杀死 VM 时它会中断。

我收到了这个错误：

... logs from my containers ...

A 2019-08-13T16:54:07.943153098Z time="2019-08-13T16:54:07Z" level=error msg="error waiting for container: unexpected EOF" 

(just after this error, I can see what I put in the 1st line of my shutting-down script, see code below)

A 2019-08-13T16:54:08.093815210Z 2019-08-13 16:54:08: Shutting down!  TEST SIGTERM SHUTTING DOWN (this is the 1st line of my shuttig-down script)
A 2019-08-13T16:54:08.093845375Z docker ps -a 
(no reult)
A 2019-08-13T16:54:08.155512145Z ps -ef 
... a lot of things, but nothing related to docker ...

2019-08-13 16:54:08: task_subscriber not running, shutting down immediately.

我使用来自 GCE 的抢占式 VM，带有 image Container-Optimized OS 73-11647.267.0 stable。我将我的码头工人作为服务运行systemctl，但我不认为这是相关的 - [编辑]实际上我可以通过这个解决我的问题。

现在，我很确定当 Google 向我的 VM 发送 ACPI 信号时会发生很多事情，甚至在从 VM 元数据中获取并调用关闭脚本之前。

我的猜测是所有服务都同时停止，最终停止docker.service。

当我的容器运行时，我可以level=error msg="error waiting for container: unexpected EOF"通过一个简单的sudo systemctl stop docker.service

这是我的shuting-down脚本的一部分：


#!/bin/bash
# This script must be added in the VM metadata as "shutdown-script" so that
# it is executed when the instance is being preempted.


CONTAINER_NAME="task_subscriber" # For example, "task_subscriber"

logTime() {
    local datetime="$(date +"%Y-%m-%d %T")"
    echo -e "$datetime: $1" # Console
    echo -e "$datetime: $1" >>/var/log/containers/$CONTAINER_NAME.log
}



logTime "Shutting down!  TEST SIGTERM SHUTTING DOWN"

echo "docker ps -a" >>/var/log/containers/$CONTAINER_NAME.log
docker ps -a >>/var/log/containers/$CONTAINER_NAME.log

echo "ps -ef" >>/var/log/containers/$CONTAINER_NAME.log
ps -ef >>/var/log/containers/$CONTAINER_NAME.log

if [[ ! "$(docker ps -q -f name=${CONTAINER_NAME})" ]]; then
    logTime "${CONTAINER_NAME} not running, shutting down immediately."
    sleep 10 # Give time to send logs
    exit 0
fi

logTime "Sending SIGTERM to ${CONTAINER_NAME}"
#docker kill --signal=SIGTERM ${CONTAINER_NAME}
systemctl stop taskexecutor.service

# Portable waitpid equivalent
while [[ "$(docker ps -q -f name=${CONTAINER_NAME})" ]]; do
    sleep 1
    logTime "Waiting for ${CONTAINER_NAME} termination"
done

logTime "${CONTAINER_NAME} is done, shutting down."
logTime "TEST SIGTERM SHUTTING DOWN BYE BYE"
sleep 10 # Give time to send logs

如果我只是systemctl stop taskexecutor.service手动调用（而不是通过真正关闭服务器），SIGTERM 信号将发送到我的 docker 并且我的应用程序会正确处理它并存在。

任何想法？

-- 我是如何解决我的问题的 --

我可以通过在我的服务配置中添加对 docker 的依赖来解决它：

[Unit]
Wants=gcr-online.target docker.service
After=gcr-online.target docker.service

我不知道除了shutdown-scriptGoogle 执行存储在元数据中之外的魔法是如何工作的。但我认为他们应该修复一些问题，Container-Optimized OS以便在 docker 停止之前发生魔法。否则，我们不能依靠它来优雅地关闭一个基本脚本（希望我在这里使用的是 systemd）......

score 0 · Accepted Answer

根据文档[1]，在抢占式 VM 实例上使用关闭脚本是可行的。但是，使用关闭脚本时似乎存在一些限制，Compute Engine 仅在尽力而为的基础上执行关闭脚本。在极少数情况下，Compute Engine 无法保证关闭脚本会完成。另外我想提一下，抢占式实例在实例抢占开始后有 30 秒[2]，这可能会在执行关闭脚本之前杀死 docker。从您的用例中提供的错误消息来看，这似乎是 Docker 连续运行更长时间的预期行为[3]。

[1] https://cloud.google.com/compute/docs/instances/create-start-preemptible-instance#handle_preemption [2] https://cloud.google.com/compute/docs/shutdownscript#limitations [3 ] https://github.com/docker/for-mac/issues/1941

google-compute-engine - 当我的虚拟机在 GCE 上被抢占时，Docker 似乎在我的关机脚本执行之前被杀死

1 回答 1

Related

Reference