我们在使用 Corosync 和 Pacemaker 为 HA 配置 PostgreSQL 时遇到问题。
crm_mon 输出为
Last updated: Thu Dec 18 10:24:04 2014
Last change: Thu Dec 18 10:16:30 2014 via crmd on umhtvappdpj05.arqiva.local
Stack: corosync
Current DC: umhtvappdpj06.arqiva.local (1) - partition with quorum
Version: 1.1.10-29.el7-368c726
2 Nodes configured
4 Resources configured
Online: [ umhtvappdpj05.arqiva.local umhtvappdpj06.arqiva.local ]
Full list of resources:
Master/Slave Set: msPostgresql [pgsql]
Masters: [ umhtvappdpj06.arqiva.local ]
Slaves: [ umhtvappdpj05.arqiva.local ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started umhtvappdpj06.arqiva.local
vip-rep (ocf::heartbeat:IPaddr2): Started umhtvappdpj06.arqiva.local
Node Attributes:
* Node umhtvappdpj05.arqiva.local:
+ master-pgsql : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-status : HS:alone
+ pgsql-xlog-loc : 0000000097000168
* Node umhtvappdpj06.arqiva.local:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000000094000090
+ pgsql-status : PRI
Migration summary:
* Node umhtvappdpj05.arqiva.local:
* Node umhtvappdpj06.arqiva.local:`
此处节点 06(umhtvappdpj06.arqiva.local) 启动为主节点,节点 05(umhtvappdpj05.arqiva.local) 充当备用节点,但两者均未连接。
recovery.conf on node 05
standby_mode = 'on'
primary_conninfo = 'host=10.52.6.95 port=5432 user=postgres application_name=umhtvappdpj05.arqiva.local keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p'
recovery_target_timeline = 'latest'`
创建的资源是:
pcs resource create vip-master IPaddr2 \
ip="10.52.6.94" \
nic="ens192" \
cidr_netmask="24" \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"
pcs resource create vip-rep IPaddr2 \
ip="10.52.6.95" \
nic="ens192" \
cidr_netmask="24" \
meta migration-threshold="0" \
op start timeout="60s" interval="0s" on-fail="stop" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="ignore"
pcs resource create pgsql ocf:heartbeat:pgsql \
pgctl="/usr/pgsql-9.3/bin/pg_ctl" \
psql="/usr/pgsql-9.3/bin/psql" \
pgdata="/pgdata/data" \
rep_mode="sync" \
node_list="10.52.6.85 10.52.6.92" \
restore_command="scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p" \
master_ip="10.52.6.95" \
primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
restart_on_promote='true' \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op monitor timeout="60s" interval="9s" on-fail="restart" role="Master" \
op promote timeout="60s" interval="0s" on-fail="restart" \
op demote timeout="60s" interval="0s" on-fail="stop" \
op stop timeout="60s" interval="0s" on-fail="block" \
op notify timeout="60s" interval="0s"
[root@umhtvappdpj05 data]# pcs resource show --all
Master/Slave Set: msPostgresql [pgsql]
Masters: [ umhtvappdpj06.arqiva.local ]
Slaves: [ umhtvappdpj05.arqiva.local ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started
vip-rep (ocf::heartbeat:IPaddr2): Started
[root@umhtvappdpj05 data]# `
唯一的异常是 corosync 和起搏器首先安装在节点 6 上,而节点 6 与节点 5 位于不同的子网中。随后节点 6 被转移到与 5 相同的子网。这可能是原因吗?也许在节点 6 上重新安装。似乎有意义。
谢谢你
萨米尔