azure - 升级 Azure Kubernetes 服务 (AKS) 后 RabbitMQ 无法启动

Question

我对@Amir Soleimani有同样的问题，但错误结果有点不同，我尝试了该帖子中的所有解决方案，但所有解决方案都不起作用......我正在使用 Azure Kubernetes Service (AKS) 及之后从1.13.xx升级到1.18.xx不能再启动 RabbitMQ。

更新- 对我有用的解决方案（请考虑这种方法，因为它可能会影响您现有的队列）

Remove current rabbitmq StatefulSet including persistent disks

========

这是我的 StatefulSet 文件：

apiVersion: v1
kind: Service
metadata:
  name: rabbitmq-management
  labels:
    app: rabbitmq
spec:
  ports:
    - port: 80
      targetPort: 15672
      name: http
  selector:
    app: rabbitmq
  type: LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
  name: rabbitmq
  labels:
    app: rabbitmq
spec:
  ports:
    - port: 5672
      name: amqp
    - port: 4369
      name: epmd
    - port: 25672
      name: rabbitmq-dist
  clusterIP: None
  selector:
    app: rabbitmq
---
apiVersion: v1
kind: Secret
metadata:
  name: rabbitmq-config
  namespace: default
type: Opaque
data:
  erlang.cookie: samplecookie==
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rabbitmq
  labels:
    app: rabbitmq
spec:
  serviceName: rabbitmq
  selector:
    matchLabels:
      app: rabbitmq
  replicas: 3
  template:
    metadata:
      labels:
        app: rabbitmq
    spec:
      containers:
        - name: rabbitmq
          image: 'rabbitmq:3.6.6-management-alpine'
          lifecycle:
            postStart:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - >
                    if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
                      sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
                      cat /etc/resolv.conf.new > /etc/resolv.conf;
                      rm /etc/resolv.conf.new;
                    fi;
                    until rabbitmqctl node_health_check; do sleep 1; done;
                    if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
                      rabbitmqctl stop_app;
                      rabbitmqctl join_cluster rabbit@rabbitmq-0;
                      rabbitmqctl start_app;
                    fi;
                    rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
          env:
            - name: RABBITMQ_ERLANG_COOKIE
              valueFrom:
                secretKeyRef:
                  name: rabbitmq-config
                  key: erlang.cookie
            - name: RABBITMQ_DEFAULT_USER
              value: username
            - name: RABBITMQ_DEFAULT_PASS
              value: password
          ports:
            - containerPort: 5672
              name: amqp
            - containerPort: 15672
              name: amqp-management
          volumeMounts:
            - mountPath: /var/lib/rabbitmq
              name: volume
  volumeClaimTemplates:
    - metadata:
        name: volume
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi

的结果kubectl describe pod rabbitmq-0

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-0']

rabbit@rabbitmq-0:
  * connected to epmd (port 4369) on rabbitmq-0
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on rabbitmq-0
  * suggestion: start the node

current node details:
- node name: 'rabbitmq-cli-91@rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==

Error: unable to connect to node 'rabbit@rabbitmq-0': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-0']

rabbit@rabbitmq-0:
  * connected to epmd (port 4369) on rabbitmq-0
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on rabbitmq-0
  * suggestion: start the node

current node details:
- node name: 'rabbitmq-cli-26@rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==

Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: rabbit application is not running on node rabbit@rabbitmq-0.
 * Suggestion: start it with "rabbitmqctl start_app" and try again
, message: "Timeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nError: unable to connect to node 'rabbit@rabbitmq-0': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@rabbitmq-0']\n\nrabbit@rabbitmq-0:\n  * connected to epmd (port 4369) on rabbitmq-0\n  * epmd reports: node 'rabbit' not running at all\n                  no other nodes on rabbitmq-0\n  * suggestion: start the node\n\ncurrent node details:\n- node name: 'rabbitmq-cli-91@rabbitmq-0'\n- home dir: /var/lib/rabbitmq\n- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==\n\nError: unable to connect to node 'rabbit@rabbitmq-0': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@rabbitmq-0']\n\nrabbit@rabbitmq-0:\n  * connected to epmd (port 4369) on rabbitmq-0\n  * epmd reports: node 'rabbit' not running at all\n                  no other nodes on rabbitmq-0\n  * suggestion: start the node\n\ncurrent node details:\n- node name: 'rabbitmq-cli-26@rabbitmq-0'\n- home dir: /var/lib/rabbitmq\n- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==\n\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: rabbit application is not running on node rabbit@rabbitmq-0.\n * Suggestion: start it with \"rabbitmqctl start_app\" and try again\n"
  Warning  FailedPostStartHook  23m  kubelet  Exec lifecycle hook ([/bin/sh -c if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
  sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
  cat /etc/resolv.conf.new > /etc/resolv.conf;
  rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
  rabbitmqctl stop_app;
  rabbitmqctl join_cluster rabbit@rabbitmq-0;
  rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_default(3ac91d73-de7b-4cde-81f6-c31bacd10252)" failed - error: command '/bin/sh -c if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
  sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
  cat /etc/resolv.conf.new > /etc/resolv.conf;
  rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
  rabbitmqctl stop_app;
  rabbitmqctl join_cluster rabbit@rabbitmq-0;
  rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
' exited with 137: Error: unable to connect to node 'rabbit@rabbitmq-0': nodedown

的结果kubectl logs rabbitmq-0

=CRASH REPORT==== 18-Jul-2021::11:06:01 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.156.0>
    registered_name: []
    exception exit: {{timeout_waiting_for_tables,
                         [rabbit_user,rabbit_user_permission,rabbit_vhost,
                          rabbit_durable_route,rabbit_durable_exchange,
                          rabbit_runtime_parameters,rabbit_durable_queue]},
                     {rabbit,start,[normal,[]]}}
      in function  application_master:init/4 (application_master.erl, line 134)
    ancestors: [<0.155.0>]
    messages: [{'EXIT',<0.157.0>,normal}]
    links: [<0.155.0>,<0.31.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 98
  neighbours:

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: rabbit
    exited: {{timeout_waiting_for_tables,
                 [rabbit_user,rabbit_user_permission,rabbit_vhost,
                  rabbit_durable_route,rabbit_durable_exchange,
                  rabbit_runtime_parameters,rabbit_durable_queue]},
             {rabbit,start,[normal,[]]}}
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: amqp_client
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: rabbit_common
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: xmerl
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: os_mon
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: inets
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: asn1
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: syntax_tools
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: mnesia
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: crypto
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: ranch
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: compiler
    exited: stopped
    type: temporary


BOOT FAILED
===========

Timeout contacting cluster nodes: ['rabbit@rabbitmq-1','rabbit@rabbitmq-2'].

BACKGROUND
==========

This cluster node was shut down while other nodes were still running.
To avoid losing data, you should start the other nodes first, then
start this one. To force this node to start, first invoke
"rabbitmqctl force_boot". If you do so, any changes made on other
cluster nodes after this one was shut down may be lost.

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-1','rabbit@rabbitmq-2']

rabbit@rabbitmq-1:
  * unable to connect to epmd (port 4369) on rabbitmq-1: nxdomain (non-existing domain)

rabbit@rabbitmq-2:
  * unable to connect to epmd (port 4369) on rabbitmq-2: nxdomain (non-existing domain)


current node details:
- node name: 'rabbit@rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==



=INFO REPORT==== 18-Jul-2021::11:06:01 ===
Timeout contacting cluster nodes: ['rabbit@rabbitmq-1','rabbit@rabbitmq-2'].

BACKGROUND
==========

This cluster node was shut down while other nodes were still running.
To avoid losing data, you should start the other nodes first, then
start this one. To force this node to start, first invoke
"rabbitmqctl force_boot". If you do so, any changes made on other
cluster nodes after this one was shut down may be lost.

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-1','rabbit@rabbitmq-2']

rabbit@rabbitmq-1:
  * unable to connect to epmd (port 4369) on rabbitmq-1: nxdomain (non-existing domain)

rabbit@rabbitmq-2:
  * unable to connect to epmd (port 4369) on rabbitmq-2: nxdomain (non-existing domain)


current node details:
- node name: 'rabbit@rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==


{"init terminating in do_boot",timeout_waiting_for_tables}
init terminating in do_boot (timeout_waiting_for_tables)

Crash dump is being written to: erl_crash.dump...

我试过但没有奏效：

rabbitmqctl stop_app
rabbitmqctl force_boot
Remove StatefulSet and re-install
Re-configure the yaml file

score 1 · Accepted Answer

请尝试在启动 scipt 后强制启动：

...

菲;

if [[ "$HOSTNAME" == "rabbitmq-0" ]]; then
                  rabbitmqctl stop_app;
                  rabbitmqctl force_boot;
                  
fi;

直到rabbitmqctl node_health_check；睡一觉；完毕; ...

azure - 升级 Azure Kubernetes 服务 (AKS) 后 RabbitMQ 无法启动

1 回答 1

Related

Reference