今天发现一套流复制环境备库异常, 已经不能和主库同步了, 这个库是日志库, 经常出现主备数据同步延迟的告警, 这类告警经常被俺疏忽, 环境信息如下:
流复制环境
PG 版本: 9.2.8
数据库大小 : > 3 TB
业务类型: 日志库
高可用: 流复制
WAL SIZE: 16 MB
每天归档量: 30000 个左右 WAL
备库数据库日志
1 | 2014-07-16 10:14:33.515 CST,,,1202,,53c5e009.4b2,2,,2014-07-16 10:14:33 CST,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 00000006000060EB000000A2 has already been removed",,,,,,,,"libpqrcv_receive, libpqwalreceiver.c:389","" |
备注:大量上述日志, 关于这个问题, 之前写了 blog ,如下:
PostgreSQL:“ FATAL: requested WAL segment 0000000800002A0000000000 has already been removed”
之前的处理方法是重做备库, 这次不需要, 因为前段时间已开启了此库的归档。
查看主库的归档
1 | postgres@db--> psql |
备注: 每天一个归档目录, 既然有了归档, 计划将归档目录 nfs 共享给备库做恢复
归档目录1
2
3postgres@db--> ls /pg_arch/arch/
2014-06-30 2014-07-02 2014-07-04 2014-07-06 2014-07-08 2014-07-10 2014-07-12 2014-07-14 2014-07-16
2014-07-01 2014-07-03 2014-07-05 2014-07-07 2014-07-09 2014-07-11 2014-07-13 2014-07-15
配置 NFS
修改主库 /etc/exports1
/pg_arch 192.168.xxx.xxx/32(rw,no_root_squash,sync)
重启 nfs 服务1
2
3
4
5
6
7
8
9
10
11[root@db- arch]# service nfs restart
Shutting down NFS daemon: [OK ]
Shutting down NFS mountd: [OK ]
Shutting down NFS quotas: [OK ]
Shutting down NFS services: [OK ]
Shutting down RPC idmapd: [OK ]
Starting NFS services: [OK ]
Starting NFS quotas: [OK ]
Starting NFS mountd: [OK ]
Starting NFS daemon: [OK ]
[[AStarting RPC idmapd: [OK ]
备库 mount 归档目录1
2[root@db2 ~]
[root@db2 ~]
备注: 我们将归档目录 mount 到 /pg_restore 目录, nfs 其它配置这里不记录了.请查看 nfs 相关文档,接下来可以进行还原操作了.
修改备库配置
修改备库 recovery.conf 文件的以下参数1
restore_command = 'cp /pg_restore/arch/2014-07-[1-3][0-9]/%f %p;'
备注: 因为要取多天归档, 所以用了正则.
修改主库配置
操作前, 调整主库参数1
wal_keep_segments = 20000
备注: 之前设置得太小了,调整后执行 reload 操作。
重启备库
1 | pg_ctl restart -m fast -D $PGDATA |
备注: 接下来进行备库恢复操作, 重启备库。
查看备库日志1
2
3
4
5
6
7
8
9
10
11
12
13
14
152014-07-16 10:14:54.583 CST,,,1305,,53c5e01e.519,1,,2014-07-16 10:14:54 CST,,0,LOG,00000,"database system was shut down in recovery at 2014-07-16 10:14:42 CST",,,,,,,,"StartupXLOG, xlog.c:6273",""
2014-07-16 10:15:09.231 CST,,,1619,"",53c5e02d.653,1,"",2014-07-16 10:15:09 CST,,0,LOG,00000,"connection received: host=192.168.100.3 port=50518",,,,,,,,"BackendInitialize, postmaster.c:3471",""
2014-07-16 10:15:09.234 CST,,,1619,"192.168.100.3:50518",53c5e02d.653,2,"",2014-07-16 10:15:09 CST,,0,LOG,08P01,"incomplete startup packet",,,,,,,,"ProcessStartupPacket, postmaster.c:1505",""
2014-07-16 10:15:50.825 CST,,,2241,"",53c5e056.8c1,1,"",2014-07-16 10:15:50 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=11732",,,,,,,,"BackendInitialize, postmaster.c:3471",""
2014-07-16 10:15:50.825 CST,"postgres","postgres",2241,"127.0.0.1:11732",53c5e056.8c1,2,"",2014-07-16 10:15:50 CST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,"ProcessStartupPacket, postmaster.c:1759",""
2014-07-16 10:15:52.906 CST,,,2258,"",53c5e058.8d2,1,"",2014-07-16 10:15:52 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=11734",,,,,,,,"BackendInitialize, postmaster.c:3471",""
....
...
2014-07-16 10:23:15.393 CST,,,1305,,53c5e01e.519,11,,2014-07-16 10:14:54 CST,1/0,0,LOG,00000,"restored log file ""00000006000060EB0000000C"" from archive",,,,,,,,"RestoreArchivedFile, xlog.c:3273",""
2014-07-16 10:23:18.318 CST,,,1305,,53c5e01e.519,12,,2014-07-16 10:14:54 CST,1/0,0,LOG,00000,"restored log file ""00000006000060EB0000000D"" from archive",,,,,,,,"RestoreArchivedFile, xlog.c:3273",""
2014-07-16 10:24:17.714 CST,,,1305,,53c5e01e.519,13,,2014-07-16 10:14:54 CST,1/0,0,LOG,00000,"restored log file ""00000006000060EB0000000E"" from archive",,,,,,,,"RestoreArchivedFile, xlog.c:3273",""
2014-07-16 10:24:20.552 CST,,,1305,,53c5e01e.519,14,,2014-07-16 10:14:54 CST,1/0,0,LOG,00000,"restored log file ""00000006000060EB0000000F"" from archive",,,,,,,,"RestoreArchivedFile, xlog.c:3273",""
....
省略部分日志
备注: 可以看到, 备库在拼命地读归档日志并 apply, 库比较大, WAL 也很多, 共有两天的 WAL 日志需要应用 , 估计没个一天半载跑不完。( 后来花了一天半才恢复!)
后续措施
- 调高 wal_keep_segments 参数
- 加大监控力度