PostgreSQL: 流复制备库 PITR 恢复一例

今天发现一套流复制环境备库异常, 已经不能和主库同步了, 这个库是日志库, 经常出现主备数据同步延迟的告警, 这类告警经常被俺疏忽, 环境信息如下:

流复制环境

PG 版本: 9.2.8
数据库大小 : > 3 TB
业务类型: 日志库
高可用: 流复制
WAL SIZE: 16 MB
每天归档量: 30000 个左右 WAL

备库数据库日志

1
2
3
2014-07-16  10:14:33.515 CST,,,1202,,53c5e009.4b2,2,,2014-07-16  10:14:33 CST,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 00000006000060EB000000A2 has already been removed",,,,,,,,"libpqrcv_receive, libpqwalreceiver.c:389",""  
2014-07-16 10:14:38.520 CST,,,1210,,53c5e00e.4ba,1,,2014-07-16 10:14:38 CST,,0,LOG,00000,"streaming replication successfully connected to primary",,,,,,,,"libpqrcv_connect, libpqwalreceiver.c:171",""
2014-07-16 10:14:38.520 CST,,,1210,,53c5e00e.4ba,2,,2014-07-16 10:14:38 CST,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 00000006000060EB000000A2 has already been removed"

备注:大量上述日志, 关于这个问题, 之前写了 blog ,如下:
PostgreSQL:“ FATAL: requested WAL segment 0000000800002A0000000000 has already been removed”

之前的处理方法是重做备库, 这次不需要, 因为前段时间已开启了此库的归档。

查看主库的归档

1
2
3
4
5
6
7
8
9
postgres@db--> psql  
psql (9.2.8)
Type "help" for help.

postgres=# show archive_command ;
archive_command
------------------------------------------------------------------------------------------------
DIR=/pg_arch/arch/`date +%F`; test ! -d $DIR && mkdir $DIR; test ! -f $DIR/%f && cp %p $DIR/%f
(1 row)

备注: 每天一个归档目录, 既然有了归档, 计划将归档目录 nfs 共享给备库做恢复

归档目录

1
2
3
postgres@db--> ls /pg_arch/arch/  
2014-06-30 2014-07-02 2014-07-04 2014-07-06 2014-07-08 2014-07-10 2014-07-12 2014-07-14 2014-07-16
2014-07-01 2014-07-03 2014-07-05 2014-07-07 2014-07-09 2014-07-11 2014-07-13 2014-07-15

配置 NFS

修改主库 /etc/exports

1
/pg_arch 192.168.xxx.xxx/32(rw,no_root_squash,sync)

重启 nfs 服务

1
2
3
4
5
6
7
8
9
10
11
[root@db- arch]# service nfs restart  
Shutting down NFS daemon: [OK ]
Shutting down NFS mountd: [OK ]
Shutting down NFS quotas: [OK ]
Shutting down NFS services: [OK ]
Shutting down RPC idmapd: [OK ]
Starting NFS services: [OK ]
Starting NFS quotas: [OK ]
Starting NFS mountd: [OK ]
Starting NFS daemon: [OK ]
[[AStarting RPC idmapd: [OK ]

备库 mount 归档目录

1
2
[root@db2 ~]# mkdir /pg_restore  
[root@db2 ~]# mount -t nfs 192.168.xxx.xxx:/pg_arch /pg_restore

备注: 我们将归档目录 mount 到 /pg_restore 目录, nfs 其它配置这里不记录了.请查看 nfs 相关文档,接下来可以进行还原操作了.

修改备库配置

修改备库 recovery.conf 文件的以下参数

1
restore_command =  'cp /pg_restore/arch/2014-07-[1-3][0-9]/%f %p;'

备注: 因为要取多天归档, 所以用了正则.

修改主库配置

操作前, 调整主库参数

1
wal_keep_segments =  20000

备注: 之前设置得太小了,调整后执行 reload 操作。

重启备库

1
pg_ctl restart -m fast -D $PGDATA

备注: 接下来进行备库恢复操作, 重启备库。

查看备库日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2014-07-16  10:14:54.583 CST,,,1305,,53c5e01e.519,1,,2014-07-16  10:14:54 CST,,0,LOG,00000,"database system was shut down in recovery at 2014-07-16 10:14:42 CST",,,,,,,,"StartupXLOG, xlog.c:6273",""  
2014-07-16 10:15:09.231 CST,,,1619,"",53c5e02d.653,1,"",2014-07-16 10:15:09 CST,,0,LOG,00000,"connection received: host=192.168.100.3 port=50518",,,,,,,,"BackendInitialize, postmaster.c:3471",""
2014-07-16 10:15:09.234 CST,,,1619,"192.168.100.3:50518",53c5e02d.653,2,"",2014-07-16 10:15:09 CST,,0,LOG,08P01,"incomplete startup packet",,,,,,,,"ProcessStartupPacket, postmaster.c:1505",""
2014-07-16 10:15:50.825 CST,,,2241,"",53c5e056.8c1,1,"",2014-07-16 10:15:50 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=11732",,,,,,,,"BackendInitialize, postmaster.c:3471",""
2014-07-16 10:15:50.825 CST,"postgres","postgres",2241,"127.0.0.1:11732",53c5e056.8c1,2,"",2014-07-16 10:15:50 CST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,"ProcessStartupPacket, postmaster.c:1759",""
2014-07-16 10:15:52.906 CST,,,2258,"",53c5e058.8d2,1,"",2014-07-16 10:15:52 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=11734",,,,,,,,"BackendInitialize, postmaster.c:3471",""
....
...
2014-07-16 10:23:15.393 CST,,,1305,,53c5e01e.519,11,,2014-07-16 10:14:54 CST,1/0,0,LOG,00000,"restored log file ""00000006000060EB0000000C"" from archive",,,,,,,,"RestoreArchivedFile, xlog.c:3273",""
2014-07-16 10:23:18.318 CST,,,1305,,53c5e01e.519,12,,2014-07-16 10:14:54 CST,1/0,0,LOG,00000,"restored log file ""00000006000060EB0000000D"" from archive",,,,,,,,"RestoreArchivedFile, xlog.c:3273",""
2014-07-16 10:24:17.714 CST,,,1305,,53c5e01e.519,13,,2014-07-16 10:14:54 CST,1/0,0,LOG,00000,"restored log file ""00000006000060EB0000000E"" from archive",,,,,,,,"RestoreArchivedFile, xlog.c:3273",""
2014-07-16 10:24:20.552 CST,,,1305,,53c5e01e.519,14,,2014-07-16 10:14:54 CST,1/0,0,LOG,00000,"restored log file ""00000006000060EB0000000F"" from archive",,,,,,,,"RestoreArchivedFile, xlog.c:3273",""
....

省略部分日志

备注: 可以看到, 备库在拼命地读归档日志并 apply, 库比较大, WAL 也很多, 共有两天的 WAL 日志需要应用 , 估计没个一天半载跑不完。( 后来花了一天半才恢复!)

后续措施

  1. 调高 wal_keep_segments 参数
  2. 加大监控力度

参考

最后推荐和张文升共同编写的《PostgreSQL实战》,本书基于PostgreSQL 10 编写,共18章,重点介绍SQL高级特性、并行查询、分区表、物理复制、逻辑复制、备份恢复、高可用、性能优化、PostGIS等,涵盖大量实战用例!

购买链接:https://item.jd.com/12405774.html

PostgreSQL实战
感谢支持!
0%