Patroni + Etcd 实现高可用之二: 高可用测试

上篇博客介绍了Patroni + Etcd部署PostgreSQL高可用,这篇博客进行高可用测试。

规划以下高可用测试场景:

  • 场景一 停备库PostgreSQL实例: 停任意一个备库实例,观察主备情况。
  • 场景二 停主库PostgreSQL实例: 停主库实例,观察主备情况。
  • 场景三 停主库主机: 停主库主机,观察主备情况。
  • 场景四 停两个备库主机: 停两个备库主机,观察主备情况。
  • 场景四 停一个etcd备进程: 停任意一个etcd备进程,观察etcd、patroni、主备情况。
  • 场景五 停etcd主进程: 停etcd主进程,观察etcd、patroni、主备情况。
  • 场景六 停两个etcd备进程: 停两个etcd备进程,观察etcd、patroni、主备情况。
  • 场景七 停一个patroni备进程: 停一个patroni备进程,观察etcd、patroni、主备情况。
  • 场景八 停patroni主进程 : 停patroni主进程,观察etcd、patroni、主备情况。
  • 场景九 停两个patroni备进程: 停两个patroni备进程,观察etcd、patroni、主备情况。

本文对以上九个高可用场景进行演练,记录前四个场景的演练过程,后五个场景仅记录演练结果。

场景一 停备库PostgreSQL实例

集群初始状态如下:

1
2
3
4
5
6
7
8
[pg12@ydtf01 ~]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 | | running | 18 | 0 |
| pg_ydtf02 | 192.168.2.12:1921 | Leader | running | 18 | |
| pg_ydtf03 | 192.168.2.13:1921 | | running | 18 | 0 |
+-----------+-------------------+--------+---------+----+-----------+

在主机pg_ydtf01上停从库,如下:

1
2
3
[pg12@ydtf01 ~]$ pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped

查看pg_ydtf01主机上的patroni日志,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
2020-05-26 08:44:32,392 INFO: no action.  i am a secondary and i am following a leader
2020-05-26 08:44:42,374 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:42,374 INFO: does not have lock
2020-05-26 08:44:42,388 INFO: no action. i am a secondary and i am following a leader


2020-05-26 08:44:52,381 WARNING: Postgresql is not running.
2020-05-26 08:44:52,382 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,407 INFO: pg_controldata:
pg_control version number: 1201
Catalog version number: 201909212
Database system identifier: 6823331807018309697
Database cluster state: shut down in recovery
pg_control last modified: Tue May 26 08:44:45 2020
Latest checkpoint location: 0/15000098
Latest checkpoint's REDO location: 0/15000060
Latest checkpoint's REDO WAL file: 000000120000000000000015
Latest checkpoint's TimeLineID: 18
Latest checkpoint's PrevTimeLineID: 18
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:505
Latest checkpoint's NextOID: 24576
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 480
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 505
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid: 0
Latest checkpoint's newestCommitTsXid: 0
Time of latest checkpoint: Tue May 26 08:33:37 2020
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/16000450
Min recovery ending locs timeline: 18
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 16384
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: e0b437070cc63fb2e55b904a85a73d4aa92d629ab56540beb214451f52967090

2020-05-26 08:44:52,411 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,411 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,417 INFO: starting as a secondary
2020-05-26 08:44:52,428 INFO: closed patroni connection to the postgresql cluster
2020-05-26 08:44:52.829 CST [39065] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-26 08:44:52.830 CST [39065] LOG: listening on IPv4 address "0.0.0.0", port 1921
2020-05-26 08:44:52.832 CST [39065] LOG: listening on Unix socket "/tmp/.s.PGSQL.1921"
2020-05-26 08:44:52,836 INFO: postmaster pid=39065
2020-05-26 08:44:52.849 CST [39065] LOG: redirecting log output to logging collector process
2020-05-26 08:44:52.849 CST [39065] HINT: Future log output will appear in directory "log".
localhost:1921 - rejecting connections
localhost:1921 - accepting connections
2020-05-26 08:44:52,891 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,891 INFO: does not have lock
2020-05-26 08:44:52,891 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 08:44:52,938 INFO: no action. i am a secondary and i am following a leader

2020-05-26 08:45:02,388 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:45:02,388 INFO: does not have lock
2020-05-26 08:45:02,397 INFO: no action. i am a secondary and i am following a leader

通过日志看出,patroni 监控到本机上的备库宕机后,自动将它拉起。

查看集群状态,没有变化,如下:

1
2
3
4
5
6
7
8
[pg12@ydtf01 ~]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 | | running | 18 | 0 |
| pg_ydtf02 | 192.168.2.12:1921 | Leader | running | 18 | |
| pg_ydtf03 | 192.168.2.13:1921 | | running | 18 | 0 |
+-----------+-------------------+--------+---------+----+-----------+

场景二 停主库PostgreSQL实例

pg_ydtf02主机上关闭主库,如下:

1
2
3
[pg12@ydtf02 patroni]$ pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped

监控主机pg_ydtf02上的patroni日志,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
2020-05-26 08:50:14,195 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:14,218 INFO: no action. i am the leader with the lock

2020-05-26 08:50:24,198 WARNING: Postgresql is not running.
2020-05-26 08:50:24,198 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,222 INFO: pg_controldata:
pg_control version number: 1201
Catalog version number: 201909212
Database system identifier: 6823331807018309697
Database cluster state: shut down
pg_control last modified: Tue May 26 08:50:15 2020
Latest checkpoint location: 0/16000450
Latest checkpoint's REDO location: 0/16000450
Latest checkpoint's REDO WAL file: 000000120000000000000016
Latest checkpoint's TimeLineID: 18
Latest checkpoint's PrevTimeLineID: 18
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:506
Latest checkpoint's NextOID: 24576
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 480
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid: 0
Latest checkpoint's newestCommitTsXid: 0
Time of latest checkpoint: Tue May 26 08:50:15 2020
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/0
Min recovery ending locs timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 16384
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: e0b437070cc63fb2e55b904a85a73d4aa92d629ab56540beb214451f52967090

2020-05-26 08:50:24,226 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,227 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,233 INFO: starting as readonly because i had the session lock
2020-05-26 08:50:24,234 INFO: closed patroni connection to the postgresql cluster
2020-05-26 08:50:24.490 CST [39033] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-26 08:50:24.491 CST [39033] LOG: listening on IPv4 address "0.0.0.0", port 1921
2020-05-26 08:50:24.493 CST [39033] LOG: listening on Unix socket "/tmp/.s.PGSQL.1921"
2020-05-26 08:50:24,512 INFO: postmaster pid=39033
2020-05-26 08:50:24.517 CST [39033] LOG: redirecting log output to logging collector process
2020-05-26 08:50:24.517 CST [39033] HINT: Future log output will appear in directory "log".
localhost:1921 - rejecting connections
localhost:1921 - accepting connections
2020-05-26 08:50:24,540 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,540 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 08:50:24,579 INFO: promoted self to leader because i had the session lock
server promoting
2020-05-26 08:50:24,585 INFO: cleared rewind state after becoming the leader
2020-05-26 08:50:25,605 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:25,651 INFO: no action. i am the leader with the lock
2020-05-26 08:50:35,605 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:35,632 INFO: no action. i am the leader with the lock

根据日志看出,pg_ydtf02上的patroni发现本机的PostgreSQL宕机后,自动将它拉起。

查看集群状态,没有发生变化,如下:

1
2
3
4
5
6
7
8
[pg12@ydtf02 patroni]$ patronictl -c /home/pg12/patroni/pg_ydtf02.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 | | running | 19 | 0 |
| pg_ydtf02 | 192.168.2.12:1921 | Leader | running | 19 | |
| pg_ydtf03 | 192.168.2.13:1921 | | running | 19 | 0 |
+-----------+-------------------+--------+---------+----+-----------+

场景三 停主库主机

重启 pg_ydtf02 主机,如下:

1
[root@ydtf02 ~]# reboot

查看 pg_ydtf01和pg_ydtf03的patroni的日志,发现 pg_ydtf01获得了leader锁,并切换为主库,日志如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
2020-05-26 08:56:32,055 INFO: Got response from pg_ydtf03 http://192.168.2.13:8008/patroni: {"state": "running", "postmaster_start_time": "2020-05-26 08:37:43.074 CST", "role": "replica", "server_version": 120002, "cluster_unlocked": true, "xlog": {"received_location": 369100432, "replayed_location": 369100432, "replayed_timestamp": "2020-05-26 08:34:34.945 CST", "paused": false}, "timeline": 19, "database_system_identifier": "6823331807018309697", "patroni": {"version": "1.6.5", "scope": "pg_ydtf"}}
2020-05-26 08:56:34,010 WARNING: Request failed to pg_ydtf02: GET http://192.168.2.12:8008/patroni (HTTPConnectionPool(host='192.168.2.12', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f872c3da280>, 'Connection to 192.168.2.12 timed out. (connect timeout=2)')))
2020-05-26 08:56:34,039 INFO: promoted self to leader by acquiring session lock
server promoting
2020-05-26 08:56:34,049 INFO: cleared rewind state after becoming the leader
2020-05-26 08:56:35,070 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:56:35,120 INFO: no action. i am the leader with the lock
2020-05-26 08:56:45,075 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:56:45,112 INFO: no action. i am the leader with the lock
2020-05-26 08:56:55,071 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:56:55,093 INFO: no action. i am the leader with the lock
2020-05-26 08:57:05,071 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:57:05,095 INFO: no action. i am the leader with the lock

查看集群状态,发现进行了主备切换,如下:

1
2
3
4
5
6
7
[pg12@ydtf01 ~]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 | Leader | running | 20 | |
| pg_ydtf03 | 192.168.2.13:1921 | | running | 20 | 0 |
+-----------+-------------------+--------+---------+----+-----------+

pg_ydtf01升级为主库,验证 pg_ydtf01的主备状态,如下:

1
2
[pg12@ydtf01 ~]$ pg_controldata | grep cluster
Database cluster state: in production

之后尝试启动pg_ydtf02上的备库,报时间线的错误,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
2020-05-29 16:53:52.780 CST,,,2574,,5ed0cd5f.a0e,27,,2020-05-29 16:52:47 CST,,0,LOG,00000,"restarted WAL streaming at 0/1C000000 on timeline 29",,,,,,,,,""
2020-05-29 16:53:52.785 CST,,,2574,,5ed0cd5f.a0e,28,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""
2020-05-29 16:53:52.786 CST,,,2505,,5ed0cd5a.9c9,20,,2020-05-29 16:52:42 CST,1/0,0,LOG,00000,"new timeline 30 forked off current database system timeline 29 before current recovery point 0/1C025A60",,,,,,,,,""
2020-05-29 16:53:57.295 CST,,,2600,"127.0.0.1:45300",5ed0cda5.a28,1,"",2020-05-29 16:53:57 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=45300",,,,,,,,,""
2020-05-29 16:53:57.297 CST,"repuser","",2600,"127.0.0.1:45300",5ed0cda5.a28,2,"authentication",2020-05-29 16:53:57 CST,3/9,0,LOG,00000,"replication connection authorized: user=repuser",,,,,,,,,""
2020-05-29 16:53:57.785 CST,,,2574,,5ed0cd5f.a0e,29,,2020-05-29 16:52:47 CST,,0,LOG,00000,"restarted WAL streaming at 0/1C000000 on timeline 29",,,,,,,,,""
2020-05-29 16:53:57.789 CST,,,2574,,5ed0cd5f.a0e,30,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""
2020-05-29 16:53:57.790 CST,,,2505,,5ed0cd5a.9c9,21,,2020-05-29 16:52:42 CST,1/0,0,LOG,00000,"new timeline 30 forked off current database system timeline 29 before current recovery point 0/1C025A60",,,,,,,,,""
2020-05-29 16:54:02.792 CST,,,2574,,5ed0cd5f.a0e,31,,2020-05-29 16:52:47 CST,,0,LOG,00000,"restarted WAL streaming at 0/1C000000 on timeline 29",,,,,,,,,""
2020-05-29 16:54:02.796 CST,,,2574,,5ed0cd5f.a0e,32,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""
2020-05-29 16:54:02.796 CST,,,2505,,5ed0cd5a.9c9,22,,2020-05-29 16:52:42 CST,1/0,0,LOG,00000,"new timeline 30 forked off current database system timeline 29 before current recovery point 0/1C025A60",,,,,,,,,""
2020-05-29 16:54:07.805 CST,,,2574,,5ed0cd5f.a0e,34,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""

解决方法: 关闭pg_ydtf02上的数据库,之后执行 pg_rewind增量同步主库上的数据,如下:

1
2
3
4
5
6
7
8
9
10
11
12
[pg12@ydtf02 log]$ pg_rewind --target-pgdata $PGDATA --source-server='host=192.168.2.11 port=1921 user=postgres dbname=postgres' -P
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 0/1C0259E8 on timeline 29
pg_rewind: rewinding from last common checkpoint at 0/1C0252E8 on timeline 29
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
pg_rewind: need to copy 325 MB (total source directory size is 345 MB)
333628/333628 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!

场景四 停两个备库主机

集群状态如下:

1
2
3
4
5
6
7
8
[pg12@ydtf01 patroni]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 | | running | 25 | 0 |
| pg_ydtf02 | 192.168.2.12:1921 | | running | 25 | 0 |
| pg_ydtf03 | 192.168.2.13:1921 | Leader | running | 25 | |
+-----------+-------------------+--------+---------+----+-----------+

在 pg_ydtf01 和 pg_ydtf02上执行 reboot模拟关闭主机,发现pg_ydtf03上的主库PostgreSQL实例也宕掉了,数据库日志如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[pg12@ydtf03 log]$ view postgresql-2020-05-29_153456.csv
2020-05-29 15:34:56.585 CST,,,35725,,5ed0bb20.8b8d,1,,2020-05-29 15:34:56 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,""
2020-05-29 15:34:56.592 CST,,,35729,,5ed0bb20.8b91,1,,2020-05-29 15:34:56 CST,,0,LOG,00000,"database system was interrupted; last known up at 2020-05-29 15:09:55 CST",,,,,,,,,""
2020-05-29 15:34:56.593 CST,,,35731,"[local]",5ed0bb20.8b93,1,"",2020-05-29 15:34:56 CST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2020-05-29 15:34:56.593 CST,"postgres","postgres",35731,"[local]",5ed0bb20.8b93,2,"",2020-05-29 15:34:56 CST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-05-29 15:34:56.593 CST,,,35730,"127.0.0.1:49080",5ed0bb20.8b92,1,"",2020-05-29 15:34:56 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=49080",,,,,,,,,""
2020-05-29 15:34:56.593 CST,"rewind_user","postgres",35730,"127.0.0.1:49080",5ed0bb20.8b92,2,"",2020-05-29 15:34:56 CST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-05-29 15:34:56.603 CST,,,35733,"127.0.0.1:49084",5ed0bb20.8b95,1,"",2020-05-29 15:34:56 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=49084",,,,,,,,,""
2020-05-29 15:34:56.603 CST,"rewind_user","postgres",35733,"127.0.0.1:49084",5ed0bb20.8b95,2,"",2020-05-29 15:34:56 CST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-05-29 15:34:56.634 CST,,,35729,,5ed0bb20.8b91,2,,2020-05-29 15:34:56 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2020-05-29 15:34:56.637 CST,,,35729,,5ed0bb20.8b91,3,,2020-05-29 15:34:56 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,,""
2020-05-29 15:34:56.641 CST,,,35729,,5ed0bb20.8b91,4,,2020-05-29 15:34:56 CST,1/0,0,LOG,00000,"redo starts at 0/1C021F68",,,,,,,,,""
2020-05-29 15:34:56.641 CST,,,35729,,5ed0bb20.8b91,5,,2020-05-29 15:34:56 CST,1/0,0,LOG,00000,"invalid record length at 0/1C022050: wanted 24, got 0",,,,,,,,,""
2020-05-29 15:34:56.643 CST,,,35729,,5ed0bb20.8b91,6,,2020-05-29 15:34:56 CST,1/0,0,LOG,00000,"consistent recovery state reached at 0/1C022050",,,,,,,,,""
2020-05-29 15:34:56.644 CST,,,35725,,5ed0bb20.8b8d,2,,2020-05-29 15:34:56 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""

从日志看出,主库异常宕机后,以备库角色启动,验证数据库角色,如下:

1
2
[pg12@ydtf03 log]$ pg_controldata | grep cluster
Database cluster state: in archive recovery

同时,pg_ydtf03上的patroni日志出现大量以下信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
2020-05-29 15:34:51,784 ERROR: Request to server http://192.168.2.12:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.2.12', port=2379): Max retries exceeded with url: /v2/keys/service/pg_ydtf/leader (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f3de40eb790>, 'Connection to 192.168.2.12 timed out. (connect timeout=1)'))")
2020-05-29 15:34:51,785 INFO: Reconnection allowed, looking for another server.
2020-05-29 15:34:51,787 ERROR:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 422, in wrapper
retval = func(self, *args, **kwargs) is not None
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 621, in _update_leader
return self.retry(self._client.write, self.leader_path, self._name, prevValue=self._name, ttl=self._ttl)
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 406, in retry
return retry(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/patroni/utils.py", line 330, in __call__
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/etcd/client.py", line 500, in write
response = self.api_execute(path, method, params=params)
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 264, in api_execute
raise ex
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 248, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 219, in _do_http_request
raise etcd.EtcdConnectionFailed('No more machines in the cluster')
etcd.EtcdConnectionFailed: No more machines in the cluster
2020-05-29 15:34:51,806 ERROR: failed to update leader lock
2020-05-29 15:34:55,268 INFO: Selected new etcd server http://192.168.2.12:2379

可见,停掉两个备库后,主库也会宕,同时主库角色切换为备库,由于备库仅提供只读服务,对生产系统有较大影响,引发生产故障。

总结

汇总九个高可用测试场景演练结果,如下:

场景 测试结果
场景一 停备库PostgreSQL实例 备库自动拉起
场景二 停主库PostgreSQL实例 主库自动拉起
场景三 停主库主机 触发主备切换,切换时间在30秒内。
场景四 停两个备库主机 主库异常宕机,之后主库以备库角色启动。
场景四 停一个etcd备进程 不影响主库、备库,patroni会连接其它节点上的etcd。
场景五 停etcd主进程 不影响主库、备库,patroni会连接其它节点上的etcd。
场景六 停两个etcd备进程 主库异常宕机,之后主库以备库角色启动。
场景七 停一个patroni备进程 一个备库宕机,不影响主库。
场景八 停patroni主进程 触发主备切换,切换时间在30秒内。
场景九 停两个patroni备进程 主库不影响

参考

最后推荐和张文升共同编写的《PostgreSQL实战》,本书基于PostgreSQL 10 编写,共18章,重点介绍SQL高级特性、并行查询、分区表、物理复制、逻辑复制、备份恢复、高可用、性能优化、PostGIS等,涵盖大量实战用例!

购买链接:https://item.jd.com/12405774.html

PostgreSQL实战
感谢支持!
0%