记录一次Postgresql的repmgr高可用集群切换故障
作者:互联网
故障代码如下
postgres@allsql02->repmgr standby switchover -f ~/repmgr.conf --siblings-follow DEBUG: connecting to: "user=repmgr password=QAZwsx123_ connect_timeout=2 dbname=repmgr host=10.10.10.12 port=5432 fallback_application_name=repmgr" NOTICE: executing switchover on node "allSql02" (ID: 238) DEBUG: connecting to: "user=repmgr password=QAZwsx123_ connect_timeout=2 dbname=repmgr host=10.10.10.11 port=5432 fallback_application_name=repmgr" DEBUG: remote_command(): ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.10.10.11 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr.conf --version >/dev/null 2>&1 && echo "1" || echo "0" DEBUG: remote_command(): ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.10.10.11 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr.conf --version 2>/dev/null DEBUG: "repmgr" version on "10.10.10.11" is 50100 DEBUG: remote_command(): ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.10.10.11 test -f /home/postgres/repmgr.conf && echo 1 || echo 0 DEBUG: remote_command(): ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.10.10.11 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr.conf node check --data-directory-config --optformat -LINFO 2>/dev/null WARNING: option "--sibling-nodes" specified, but no sibling nodes exist DEBUG: remote_command(): ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.10.10.11 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr.conf node check --remote-node-id=238 --replication-connection DEBUG: connecting to: "user=repmgr password=QAZwsx123_ connect_timeout=2 dbname=repmgr host=10.10.10.11 port=5432 fallback_application_name=repmgr" DEBUG: remote_command(): ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.10.10.11 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr.conf node check --terse -LERROR --archive-ready --optformat DEBUG: lag is 0 DEBUG: connecting to: "user=repmgr password=QAZwsx123_ connect_timeout=2 dbname=repmgr host=10.10.10.11 port=5432 fallback_application_name=repmgr" DEBUG: connecting to: "user=repmgr password=QAZwsx123_ connect_timeout=2 dbname=repmgr host=10.10.10.12 port=5432 fallback_application_name=repmgr" NOTICE: local node "allSql02" (ID: 238) will be promoted to primary; current primary "allSql01" (ID: 237) will be demoted to standby NOTICE: stopping current primary node "allSql01" (ID: 237) DEBUG: remote_command(): ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.10.10.11 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr.conf node service --action=stop --checkpoint DEBUG: connecting to: "user=repmgr password=QAZwsx123_ connect_timeout=2 dbname=repmgr host=10.10.10.11 port=5432 fallback_application_name=repmgr" NOTICE: issuing CHECKPOINT on node "allSql01" (ID: 237) DETAIL: executing server command "/usr/pgsql/bin/pg_ctl -D '/data/pgdata/11/data' -W -m fast stop" INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout") DEBUG: ping status is: PQPING_REJECT DEBUG: sleeping 1 second until next check INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout") DEBUG: ping status is: PQPING_NO_RESPONSE DEBUG: remote_command(): ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.10.10.11 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr.conf node status --is-shutdown-cleanly NOTICE: current primary has been cleanly shut down at location 0/F000028 NOTICE: waiting up to 30 seconds (parameter "wal_receive_check_timeout") for received WAL to flush to disk INFO: sleeping 1 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 2 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 3 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 4 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 5 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 6 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 7 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 8 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 9 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 10 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 11 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 12 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 13 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 14 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 15 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 16 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 17 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 18 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 19 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 20 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 21 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 22 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 23 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 24 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 25 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 26 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 27 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 28 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 29 of maximum 30 seconds waiting for standby to flush received WAL to disk INFO: sleeping 30 of maximum 30 seconds waiting for standby to flush received WAL to disk WARNING: local node "allSql02" is behind shutdown primary "allSql01" DETAIL: local node last receive LSN is 0/E086000, primary shutdown checkpoint LSN is 0/F000028 NOTICE: aborting switchover HINT: use --always-promote to force promotion of standby
发现并没有切换成功:同时1节点发生宕机,二节点没切换回来
一节点:
postgres@allsql01->repmgr -f ~/repmgr.conf cluster show DEBUG: connecting to: "user=repmgr password=QAZwsx123_ connect_timeout=2 dbname=repmgr host=10.10.10.11 port=5432 fallback_application_name=repmgr" ERROR: connection to database failed DETAIL: could not connect to server: Connection refused Is the server running on host "10.10.10.11" and accepting TCP/IP connections on port 5432? DETAIL: attempted to connect using: user=repmgr password=QAZwsx123_ connect_timeout=2 dbname=repmgr host=10.10.10.11 port=5432 fallback_application_name=repmgr postgres@allsql01->pg_ctl -D /data/pgdata/11/data -l logfile start waiting for server to start.... done server started
并没有发生切换。
修改参数:
主从节点修改
shutdown_check_timeout =100 然后postgresql重启
HINT: use --always-promote to force promotion of standby报错解决
参考:https://github.com/2ndQuadrant/repmgr/issues/518
标签:INFO,disk,Postgresql,standby,30,waiting,集群,repmgr 来源: https://blog.51cto.com/lishiyan/2633801