PostgreSQL流复制:备库主机宕机一例

今天一数据库流复制环境备库由于硬件原因导致主机宕机,PostgreSQL 版本为 9.2.1,持续时间为 5 小时左右,可能很多人会认为,备库挂了,可能要重做了,庆幸的是, pg 在 这方面非常省心,在这种情况下,理论上只要主库的保持的 WAL 足够多,备库就能赶上主库并恢复,简单记录下。

备库数据库日志

备库系统启动后,数据库日志如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
2012-10-29 20:07:50.193 CST,,,11520,,508e7196.2d00,1,,2012-10-29 20:07:50 CST,,0,LOG,00000,"database system was interrupted while in recovery at log time 2012-10-29 15:12:23 CST",,"If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.",,,,,,,""  
2012-10-29 20:07:50.226 CST,,,11520,,508e7196.2d00,2,,2012-10-29 20:07:50 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2012-10-29 20:07:50.280 CST,,,11520,,508e7196.2d00,3,,2012-10-29 20:07:50 CST,1/0,0,LOG,00000,"redo starts at DC/DAAC3460",,,,,,,,,""
2012-10-29 20:08:42.461 CST,,,11521,,508e7196.2d01,1,,2012-10-29 20:07:50 CST,,0,LOG,00000,"restartpoint starting: xlog",,,,,,,,,""
2012-10-29 20:08:57.384 CST,,,11520,,508e7196.2d00,4,,2012-10-29 20:07:50 CST,1/0,0,LOG,00000,"consistent recovery state reached at DD/CED97FE0",,,,,,,,,""
2012-10-29 20:08:57.385 CST,,,11518,,508e7195.2cfe,1,,2012-10-29 20:07:49 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""
2012-10-29 20:08:57.403 CST,,,11520,,508e7196.2d00,5,,2012-10-29 20:07:50 CST,1/0,0,LOG,00000,"unexpected pageaddr D8/FAEC8000 in log file 221, segment 206, offset 15499264",,,,,,,,,""
2012-10-29 20:08:57.442 CST,,,11607,,508e71d9.2d57,1,,2012-10-29 20:08:57 CST,,0,LOG,00000,"streaming replication successfully connected to primary",,,,,,,,,""
2012-10-29 20:09:41.259 CST,,,11642,"",508e7205.2d7a,1,"",2012-10-29 20:09:41 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=13025",,,,,,,,,""
2012-10-29 20:09:51.146 CST,,,11643,"",508e720f.2d7b,1,"",2012-10-29 20:09:51 CST,,0,LOG,00000,"connection received: host=192.168.1.38 port=53984",,,,,,,,,""
2012-10-29 20:09:51.146 CST,,,11643,"xxx.xxx.xxx.xxx:xxxxx",508e720f.2d7b,2,"",2012-10-29 20:09:51 CST,,0,LOG,08P01,"incomplete startup packet",,,,,,,,,""
2012-10-29 20:09:54.177 CST,,,11521,,508e7196.2d01,2,,2012-10-29 20:07:50 CST,,0,LOG,00000,"restartpoint complete: wrote 81492 buffers (31.1%); 0 transaction log file(s) added, 120 removed, 257 recycled; write=56.140 s, sync=14.822 s, total=71.715 s; sync files=66, longest=5.080 s, average=0.224 s",,,,,,,,,""
2012-10-29 20:09:54.177 CST,,,11521,,508e7196.2d01,3,,2012-10-29 20:07:50 CST,,0,LOG,00000,"recovery restart point at DD/5B013708","last completed transaction was at log time 2012-10-29 15:26:04.08085+08",,,,,,,,""

备注:上面是备库系统宕机后,重新启动 PostgreSQL 时的日志,在备库恢复过程中还看不到 wal 接收进程,过会之后能看到如下进程。

检查备库 WAL 接收进程

1
2
3
[postgres@db](mailto:postgres@db)> ps -ef | grep wal  
postgres 11607 11518 1 20:08 ? 00:00:43 postgres: wal receiver process streaming E1/1D11F308
postgres 16693 11871 0 20:59 pts/2 00:00:00 grep wal

主库 Wal_keep_segments 参数

1
2
3
4
5
postgres=> show wal_keep_segments ;  
wal_keep_segments
-------------------
1024
(1 row)

备注:主库设置的 wal_keep_segments 参数为 1024,一个较大的 wal_keep_segments 设置,允许备库在宕机较长的时间内依然能够重新追上主库,当然这与主库的繁忙程度有关,主库越忙,产生的 WAL 日志越多,之前的 WAL 日志越容易被覆盖。

总结

此文仅简单记录了 PostgreSQL 流复制环境,standby 由于系统挂掉后恢复的情况,从以上看出,备库挂了之后,重新恢复是非常容易的,前提是在硬盘空间允许的范围内,主库尽量设置较大的 wal_keep_segments 参数,从而保证较多的 WAL 日志文件。

最后推荐和张文升共同编写的《PostgreSQL实战》,本书基于PostgreSQL 10 编写,共18章,重点介绍SQL高级特性、并行查询、分区表、物理复制、逻辑复制、备份恢复、高可用、性能优化、PostGIS等,涵盖大量实战用例!

购买链接:https://item.jd.com/12405774.html

PostgreSQL实战
感谢支持!
0%