repmgr切换以及故障转移恢复
作者:互联网
[pg10@data01 data]$ repmgr -f ~/conf/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------------+---------+-----------+--------------+----------+----------+----------+---------------------------------------------------------------------------
1 | 10.10.10.101 | primary | * running | | default | 100 | 1 | host=10.10.10.101 user=repmgr dbname=postgres password=1qaz@WSX port=5666
2 | 10.10.10.102 | standby | running | 10.10.10.101 | default | 100 | 1 | host=10.10.10.102 user=repmgr dbname=postgres password=1qaz@WSX port=5666
3 | 10.10.10.103 | standby | running | 10.10.10.101 | default | 100 | 1 | host=10.10.10.103 user=repmgr dbname=postgres password=1qaz@WSX port=5666
[pg10@data01 data]$
[pg10@data02 conf]$ repmgr -f ~/conf/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------------+---------+-----------+--------------+----------+----------+----------+---------------------------------------------------------------------------
1 | 10.10.10.101 | primary | * running | | default | 100 | 1 | host=10.10.10.101 user=repmgr dbname=postgres password=1qaz@WSX port=5666
2 | 10.10.10.102 | standby | running | 10.10.10.101 | default | 100 | 1 | host=10.10.10.102 user=repmgr dbname=postgres password=1qaz@WSX port=5666
3 | 10.10.10.103 | standby | running | 10.10.10.101 | default | 100 | 1 | host=10.10.10.103 user=repmgr dbname=postgres password=1qaz@WSX port=5666
[pg10@data02 conf]$
[pg10@data03 conf]$ repmgr -f ~/conf/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------------+---------+-----------+--------------+----------+----------+----------+---------------------------------------------------------------------------
1 | 10.10.10.101 | primary | * running | | default | 100 | 1 | host=10.10.10.101 user=repmgr dbname=postgres password=1qaz@WSX port=5666
2 | 10.10.10.102 | standby | running | 10.10.10.101 | default | 100 | 1 | host=10.10.10.102 user=repmgr dbname=postgres password=1qaz@WSX port=5666
3 | 10.10.10.103 | standby | running | 10.10.10.101 | default | 100 | 1 | host=10.10.10.103 user=repmgr dbname=postgres password=1qaz@WSX port=5666
[pg10@data03 conf]$
在节点2上执行 ,将主节点切换到2节点
[pg10@data02 conf]$ repmgr standby switchover -f ~/conf/repmgr.conf --siblings-follow
NOTICE: executing switchover on node "10.10.10.102" (ID: 2)
NOTICE: local node "10.10.10.102" (ID: 2) will be promoted to primary; current primary "10.10.10.101" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "10.10.10.101" (ID: 1)
NOTICE: issuing CHECKPOINT on node "10.10.10.101" (ID: 1)
DETAIL: executing server command "pg_ctl -D '/home/pg10/data' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/7000028
NOTICE: promoting standby to primary
DETAIL: promoting server "10.10.10.102" (ID: 2) using "pg_ctl -w -D '/home/pg10/data' promote"
waiting for server to promote.... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "10.10.10.102" (ID: 2) was successfully promoted to primary
INFO: local node 1 can attach to rejoin target node 2
DETAIL: local node's recovery point: 0/7000028; rejoin target node's fork point: 0/7000098
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=10.10.10.101 user=repmgr dbname=postgres password=1qaz@WSX port=5666"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "pg_ctl -w -D '/home/pg10/data' start"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
NOTICE: node "10.10.10.102" (ID: 2) promoted to primary, node "10.10.10.101" (ID: 1) demoted to standby
NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: switchover was successful
DETAIL: node "10.10.10.102" is now primary and node "10.10.10.101" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully
[pg10@data02 conf]$
节点1上查询
[pg10@data01 data]$ repmgr -f ~/conf/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------------+---------+-----------+--------------+----------+----------+----------+---------------------------------------------------------------------------
1 | 10.10.10.101 | standby | running | 10.10.10.102 | default | 100 | 1 | host=10.10.10.101 user=repmgr dbname=postgres password=1qaz@WSX port=5666
2 | 10.10.10.102 | primary | * running | | default | 100 | 2 | host=10.10.10.102 user=repmgr dbname=postgres password=1qaz@WSX port=5666
3 | 10.10.10.103 | standby | running | 10.10.10.102 | default | 100 | 1 | host=10.10.10.103 user=repmgr dbname=postgres password=1qaz@WSX port=5666
[pg10@data01 data]$
--配置自动切换参数到 ~/conf/repmgr.conf 文件中
节点1
[pg10@data01 ~]$ vim ~/conf/repmgr.conf
node_id=1
node_name='10.10.10.101'
conninfo='host=10.10.10.101 user=repmgr dbname=postgres password=1qaz@WSX port=5666'
data_directory='/home/pg10/data'
log_level='info'
log_facility='STDERR'
log_file='/home/pg10/conf/repmgr.log'
failover=automatic
promote_command='/home/pg10/soft/bin/repmgr standby promote -f /home/pg10/conf/repmgr.conf --log-to-file'
follow_command='/home/pg10/soft/bin/repmgr standby follow -f /home/pg10/conf/repmgr.conf --log-to-file --upstream-node-id=%n'
节点2
node_id=2
node_name='10.10.10.102'
conninfo='host=10.10.10.102 user=repmgr dbname=postgres password=1qaz@WSX port=5666'
data_directory='/home/pg10/data'
log_level='info'
log_facility='STDERR'
log_file='/home/pg10/conf/repmgr.log'
failover=automatic
promote_command='/home/pg10/soft/bin/repmgr standby promote -f /home/pg10/conf/repmgr.conf --log-to-file'
follow_command='/home/pg10/soft/bin/repmgr standby follow -f /home/pg10/conf/repmgr.conf --log-to-file --upstream-node-id=%n'
节点3
node_id=3
node_name='10.10.10.103'
conninfo='host=10.10.10.103 user=repmgr dbname=postgres password=1qaz@WSX port=5666'
data_directory='/home/pg10/data'
log_level='info'
log_facility='STDERR'
log_file='/home/pg10/conf/repmgr.log'
failover=automatic
promote_command='/home/pg10/soft/bin/repmgr standby promote -f /home/pg10/conf/repmgr.conf --log-to-file'
follow_command='/home/pg10/soft/bin/repmgr standby follow -f /home/pg10/conf/repmgr.conf --log-to-file --upstream-node-id=%n'
启动 repmgrd 进程
[pg10@data01 ~]$ repmgrd -f ~/conf/repmgr.conf -d
[2021-01-06 09:52:00] [NOTICE] redirecting logging output to "/home/pg10/conf/repmgr.log"
pg10@data02 ~]$ repmgrd -f ~/conf/repmgr.conf -d
[2021-01-06 10:53:20] [NOTICE] redirecting logging output to "/home/pg10/conf/repmgr.log"
[pg10@data03 conf]$ repmgrd -f ~/conf/repmgr.conf -d
[2021-01-06 10:53:26] [NOTICE] redirecting logging output to "/home/pg10/conf/repmgr.log"
任意节点查看集群状态
[pg10@data01 ~]$ repmgr -f ~/conf/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------------+---------+-----------+--------------+----------+----------+----------+---------------------------------------------------------------------------
1 | 10.10.10.101 | standby | running | 10.10.10.102 | default | 100 | 2 | host=10.10.10.101 user=repmgr dbname=postgres password=1qaz@WSX port=5666
2 | 10.10.10.102 | primary | * running | | default | 100 | 2 | host=10.10.10.102 user=repmgr dbname=postgres password=1qaz@WSX port=5666
3 | 10.10.10.103 | standby | running | 10.10.10.102 | default | 100 | 2 | host=10.10.10.103 user=repmgr dbname=postgres password=1qaz@WSX port=5666
查看服务状态
[pg10@data01 ~]$ repmgr -f ~/conf/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+--------------+---------+-----------+--------------+---------+-------+---------+--------------------
1 | 10.10.10.101 | standby | running | 10.10.10.102 | running | 74894 | no | 0 second(s) ago
2 | 10.10.10.102 | primary | * running | | running | 74212 | no | n/a
3 | 10.10.10.103 | standby | running | 10.10.10.102 | running | 74615 | no | 0 second(s) ago
模拟故障停掉节点2
[pg10@data02 ~]$
[pg10@data02 ~]$ pg_ctl stop -D $PGDATA -l /tmp/logfile
waiting for server to shut down.... done
server stopped
持续观察下,发现节点1又转换成了主节点,持续时间长点。
[pg10@data01 ~]$ repmgr -f ~/conf/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+--------------+---------+---------------+----------------+---------+-------+---------+--------------------
1 | 10.10.10.101 | standby | running | ? 10.10.10.102 | running | 74894 | no | 7 second(s) ago
2 | 10.10.10.102 | primary | ? unreachable | ? | n/a | n/a | n/a | n/a
3 | 10.10.10.103 | standby | running | ? 10.10.10.102 | running | 74615 | no | 7 second(s) ago
WARNING: following issues were detected
- unable to connect to node "10.10.10.101" (ID: 1)'s upstream node "10.10.10.102" (ID: 2)
- unable to determine if node "10.10.10.101" (ID: 1) is attached to its upstream node "10.10.10.102" (ID: 2)
- unable to connect to node "10.10.10.102" (ID: 2)
- node "10.10.10.102" (ID: 2) is registered as an active primary but is unreachable
- unable to connect to node "10.10.10.103" (ID: 3)'s upstream node "10.10.10.102" (ID: 2)
- unable to determine if node "10.10.10.103" (ID: 3) is attached to its upstream node "10.10.10.102" (ID: 2)
HINT: execute with --verbose option to see connection error messages
[pg10@data01 ~]$ repmgr -f ~/conf/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+--------------+---------+---------------+----------------+---------+-------+---------+--------------------
1 | 10.10.10.101 | standby | running | ? 10.10.10.102 | running | 74894 | no | 10 second(s) ago
2 | 10.10.10.102 | primary | ? unreachable | ? | n/a | n/a | n/a | n/a
3 | 10.10.10.103 | standby | running | ? 10.10.10.102 | running | 74615 | no | 10 second(s) ago
WARNING: following issues were detected
- unable to connect to node "10.10.10.101" (ID: 1)'s upstream node "10.10.10.102" (ID: 2)
- unable to determine if node "10.10.10.101" (ID: 1) is attached to its upstream node "10.10.10.102" (ID: 2)
- unable to connect to node "10.10.10.102" (ID: 2)
- node "10.10.10.102" (ID: 2) is registered as an active primary but is unreachable
- unable to connect to node "10.10.10.103" (ID: 3)'s upstream node "10.10.10.102" (ID: 2)
- unable to determine if node "10.10.10.103" (ID: 3) is attached to its upstream node "10.10.10.102" (ID: 2)
HINT: execute with --verbose option to see connection error messages
[pg10@data01 ~]$ repmgr -f ~/conf/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+--------------+---------+---------------+----------------+---------+-------+---------+--------------------
1 | 10.10.10.101 | standby | running | ? 10.10.10.102 | running | 74894 | no | 13 second(s) ago
2 | 10.10.10.102 | primary | ? unreachable | ? | n/a | n/a | n/a | n/a
3 | 10.10.10.103 | standby | running | ? 10.10.10.102 | running | 74615 | no | 13 second(s) ago
WARNING: following issues were detected
- unable to connect to node "10.10.10.101" (ID: 1)'s upstream node "10.10.10.102" (ID: 2)
- unable to determine if node "10.10.10.101" (ID: 1) is attached to its upstream node "10.10.10.102" (ID: 2)
- unable to connect to node "10.10.10.102" (ID: 2)
- node "10.10.10.102" (ID: 2) is registered as an active primary but is unreachable
- unable to connect to node "10.10.10.103" (ID: 3)'s upstream node "10.10.10.102" (ID: 2)
- unable to determine if node "10.10.10.103" (ID: 3) is attached to its upstream node "10.10.10.102" (ID: 2)
HINT: execute with --verbose option to see connection error messages
[pg10@data01 ~]$ repmgr -f ~/conf/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+--------------+---------+---------------+----------------+---------+-------+---------+--------------------
1 | 10.10.10.101 | standby | running | ? 10.10.10.102 | running | 74894 | no | 40 second(s) ago
2 | 10.10.10.102 | primary | ? unreachable | ? | n/a | n/a | n/a | n/a
3 | 10.10.10.103 | standby | running | ? 10.10.10.102 | running | 74615 | no | 40 second(s) ago
WARNING: following issues were detected
- unable to connect to node "10.10.10.101" (ID: 1)'s upstream node "10.10.10.102" (ID: 2)
- unable to determine if node "10.10.10.101" (ID: 1) is attached to its upstream node "10.10.10.102" (ID: 2)
- unable to connect to node "10.10.10.102" (ID: 2)
- node "10.10.10.102" (ID: 2) is registered as an active primary but is unreachable
- unable to connect to node "10.10.10.103" (ID: 3)'s upstream node "10.10.10.102" (ID: 2)
- unable to determine if node "10.10.10.103" (ID: 3) is attached to its upstream node "10.10.10.102" (ID: 2)
HINT: execute with --verbose option to see connection error messages
[pg10@data01 ~]$ repmgr -f ~/conf/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+--------------+---------+-----------+--------------+---------+-------+---------+--------------------
1 | 10.10.10.101 | primary | * running | | running | 74894 | no | n/a
2 | 10.10.10.102 | primary | - failed | ? | n/a | n/a | n/a | n/a
3 | 10.10.10.103 | standby | running | 10.10.10.101 | running | 74615 | no | 1 second(s) ago
WARNING: following issues were detected
- unable to connect to node "10.10.10.102" (ID: 2)
HINT: execute with --verbose option to see connection error messages
[pg10@data01 ~]$
[pg10@data01 ~]$ repmgr -f ~/conf/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------------+---------+-----------+--------------+----------+----------+----------+---------------------------------------------------------------------------
1 | 10.10.10.101 | primary | * running | | default | 100 | 3 | host=10.10.10.101 user=repmgr dbname=postgres password=1qaz@WSX port=5666
2 | 10.10.10.102 | primary | - failed | ? | default | 100 | | host=10.10.10.102 user=repmgr dbname=postgres password=1qaz@WSX port=5666
3 | 10.10.10.103 | standby | running | 10.10.10.101 | default | 100 | 2 | host=10.10.10.103 user=repmgr dbname=postgres password=1qaz@WSX port=5666
WARNING: following issues were detected
- unable to connect to node "10.10.10.102" (ID: 2)
HINT: execute with --verbose option to see connection error messages
[pg10@data01 ~]$
故障节点需要停止,删除 $PGDATA/* /data/appdb_tbs/* /data/pglog
重新执行克隆
打开节点数据库
[pg10@data02 ~]$ cd $PGDATA
[pg10@data02 data]$ ll
total 72
-rw-------. 1 pg10 pg10 199 Jan 6 09:41 backup_label.old
drwx------. 5 pg10 pg10 41 Jan 6 09:41 base
-rw-------. 1 pg10 pg10 37 Jan 6 09:42 current_logfiles
drwx------. 2 pg10 pg10 4096 Jan 6 09:46 global
drwx------. 2 pg10 pg10 6 Jan 6 09:41 pg_commit_ts
drwx------. 2 pg10 pg10 6 Jan 6 09:41 pg_dynshmem
-rw-------. 1 pg10 pg10 4883 Jan 6 09:41 pg_hba.conf
-rw-------. 1 pg10 pg10 1636 Jan 6 09:41 pg_ident.conf
drwx------. 4 pg10 pg10 68 Jan 6 10:56 pg_logical
drwx------. 4 pg10 pg10 36 Jan 6 09:41 pg_multixact
drwx------. 2 pg10 pg10 18 Jan 6 09:42 pg_notify
drwx------. 2 pg10 pg10 6 Jan 6 09:41 pg_replslot
drwx------. 2 pg10 pg10 6 Jan 6 09:41 pg_serial
drwx------. 2 pg10 pg10 6 Jan 6 09:41 pg_snapshots
drwx------. 2 pg10 pg10 63 Jan 6 10:56 pg_stat
drwx------. 2 pg10 pg10 6 Jan 6 10:56 pg_stat_tmp
drwx------. 2 pg10 pg10 18 Jan 6 09:47 pg_subtrans
drwx------. 2 pg10 pg10 19 Jan 6 09:41 pg_tblspc
drwx------. 2 pg10 pg10 6 Jan 6 09:41 pg_twophase
-rw-------. 1 pg10 pg10 3 Jan 6 09:41 PG_VERSION
drwx------. 3 pg10 pg10 220 Jan 6 10:52 pg_wal
drwx------. 2 pg10 pg10 18 Jan 6 09:41 pg_xact
-rw-------. 1 pg10 pg10 526 Jan 6 09:41 postgresql.auto.conf
-rw-------. 1 pg10 pg10 23039 Jan 6 09:41 postgresql.conf
-rw-------. 1 pg10 pg10 52 Jan 6 09:42 postmaster.opts
-rw-------. 1 pg10 pg10 169 Jan 6 09:41 recovery.done
-rw-rw-r--. 1 pg10 pg10 969 Jan 6 09:41 server.crt
-rw-------. 1 pg10 pg10 1704 Jan 6 09:41 server.key
[pg10@data02 data]$ rm -rf *
[pg10@data02 data]$ cd /data/
[pg10@data02 data]$ ls
appdb_tbs pglog
[pg10@data02 data]$ cd appdb_tbs/
[pg10@data02 appdb_tbs]$ ll
total 0
drwx------. 3 pg10 pg10 19 Jan 6 09:41 PG_10_201707211
[pg10@data02 appdb_tbs]$ rm -rf PG_10_201707211/
[pg10@data02 appdb_tbs]$ cd ..
[pg10@data02 data]$ ll
total 0
drwxrwxr-x. 2 pg10 pg10 6 Jan 6 11:01 appdb_tbs
drwxrwxr-x. 2 pg10 pg10 52 Jan 6 09:42 pglog
[pg10@data02 data]$ cd pglog/
[pg10@data02 pglog]$ ll
total 12
-rw-------. 1 pg10 pg10 166 Jan 6 09:42 postgresql.06
-rw-------. 1 pg10 pg10 6199 Jan 6 10:56 postgresql.06.csv
[pg10@data02 pglog]$ rm *
[pg10@data02 pglog]$
[pg10@data02 ~]$ repmgr -h 10.10.10.101 -U repmgr -d postgres -f ~/conf/repmgr.conf standby clone --dry-run -p5666
NOTICE: destination directory "/home/pg10/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=10.10.10.101 user=repmgr port=5666 dbname=postgres
DETAIL: current installation size is 30 MB
INFO: "repmgr" extension is installed in database "postgres"
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 9 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: all prerequisites for "standby clone" are met
[pg10@data02 ~]$ repmgr -h 10.10.10.101 -U repmgr -d postgres -f ~/conf/repmgr.conf standby clone -p5666
NOTICE: destination directory "/home/pg10/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=10.10.10.101 user=repmgr port=5666 dbname=postgres
DETAIL: current installation size is 30 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/home/pg10/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
pg_basebackup -l "repmgr base backup" -D /home/pg10/data -h 10.10.10.101 -p 5666 -U repmgr -X stream
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /home/pg10/data start
HINT: after starting the server, you need to re-register this standby with "repmgr standby register --force" to update the existing node record
[pg10@data02 ~]$ pg_ctl start -D $PGDATA -l /tmp/logfile
waiting for server to start.... done
server started
[pg10@data02 ~]$ repmgrd -f ~/conf/repmgr.conf -d
[2021-01-06 11:03:44] [NOTICE] redirecting logging output to "/home/pg10/conf/repmgr.log"
[pg10@data02 ~]$ repmgr -f ~/conf/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------------+---------+-----------+--------------+----------+----------+----------+---------------------------------------------------------------------------
1 | 10.10.10.101 | primary | * running | | default | 100 | 3 | host=10.10.10.101 user=repmgr dbname=postgres password=1qaz@WSX port=5666
2 | 10.10.10.102 | standby | running | | default | 100 | 3 | host=10.10.10.102 user=repmgr dbname=postgres password=1qaz@WSX port=5666
3 | 10.10.10.103 | standby | running | 10.10.10.101 | default | 100 | 3 | host=10.10.10.103 user=repmgr dbname=postgres password=1qaz@WSX port=5666
[pg10@data02 ~]$
标签:node,10.102,pg10,故障,切换,conf,10.10,repmgr 来源: https://blog.51cto.com/lishiyan/2583102