PostgreSQL:FATAL: requested WAL segment 0000000800002A0000000000 has already been removed

昨天,一重要生产环境的备库主机由于硬件故障需要停机做硬件检测,由于是流复制环境,备库可以停,停机检测大概花了 2 小时左右,之后再次启动备库时,报了如下错误:

数据库日志

1
2
3
2013-07-01 13:25:29.430 CST,,,27738,,51d112c8.6c5a,1,,2013-07-01 13:25:28 CST,,0,LOG,00000,"streaming replication successfully connected to primary",,,,,,,,"libpqrcv_connect, libpqwalreceiver.c:171",""
2013-07-01 13:25:29.430 CST,,,27738,,51d112c8.6c5a,2,,2013-07-01 13:25:28 CST,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 0000000800002A0000000000 has already been removed
",,,,,,,,"libpqrcv_receive, libpqwalreceiver.c:389",""

备注:根据报错信息,很容易知道是由于停机时间的过程中备库所需的 WAL 已经被主库循环使用覆盖了,而在备库停机维护过程中,主库并未打开归档,这时这个备库需要重做了。也许有人会问,为何不一直打开主库的归档,我想说的是,这个库在 TB 级而且比较繁忙,忙的时候一天的归档 600 GB左右,这么大的归档需要大量的存储。但是在备库停机维护过程中,建议主库打开归档,只要不把归档目录撑满,那么在备库重新恢复后,有了主库的归档,那么备库依然能够跟上主库,为了加深理解,下面模拟这个错误,并演示规避方法:

环境信息

主机: 笔记本虚拟机
系统: Red Hat Enterprise Linux Server release 6.2
版本: PostgreSQL 9.3beta1
主库IP: 192.168.1.36 主机名:redhatB
备库IP: 192.168.1.35 主机名: redhat6
备注: 流复制搭建过程略,参考 PostgreSQL:使用 pg_basebackup 搭建流复制环境

模拟过程

2.1 设置主库 postgresql.conf

为了容易出演示效果,设置以下参数,其它参数根据需求设置:

1
2
3
4
5
6
checkpoint_segments = 3
archive_mode = on
archive_command = 'cp %p /archive/pg93/%f'
max_wal_senders = 3
wal_keep_segments = 3
max_wal_senders = 3

备注:归档目录 /archive/pg93/ 需要创建好并赋相应权限。

2.2 重载配置文件

1
2
[pg93@redhatB pg_root]$ pg_ctl reload -D $PGDATA
server signaled

备注:参数修改好后需要 reload 。

2.3 测试前: 主库数据

1
2
3
4
5
6
7
8
9
10
11
[pg93@redhatB ~]$ psql
psql (9.3beta1)
Type "help" for help.

postgres=# select * from test_1;
id | create_time
----+---------------------
1 | 2013-07-01 21:15:34
2 | 2013-07-01 21:55:37
3 | 2013-07-01 22:01:18
(3 rows)

2.4 测试前: 备库数据

1
2
3
4
5
6
7
8
9
10
[pg93@redhat6 ~]$ psql
psql (9.3beta1)
Type "help" for help.
postgres=# select * from test_1;
id | create_time
----+---------------------
1 | 2013-07-01 21:15:34
2 | 2013-07-01 21:55:37
3 | 2013-07-01 22:01:18
(3 rows)

备注:这里先标记下数据,以便后面做对比。

2.5 停备库

1
2
3
[pg93@redhat6 ~]$ pg_ctl stop -m fast -D $PGDATA
waiting for server to shut down.... done
server stopped

2.6 在主库上执行以下操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
postgres=# insert into test_1 values (5,now());
INSERT 0 1
postgres=# select pg_switch_xlog();
pg_switch_xlog
----------------
1/310000AC
(1 row)
postgres=# insert into test_1 values (5,now());
INSERT 0 1
postgres=# select pg_switch_xlog();
pg_switch_xlog
----------------
1/320004D0
(1 row)
postgres=# insert into test_1 values (5,now());
INSERT 0 1
postgres=# select pg_switch_xlog();
pg_switch_xlog
----------------
1/330000AC
(1 row)
.....

便于显示,重复内容就不贴了。

2.7 查看归档目录

1
2
[pg93@redhatB pg93]$ ll /archive/pg93/ | wc -l
36

备注:这时归档目录就产生了一些归档的 WAL 日志。

2.8 启动备库

1
2
[pg93@redhat6 ~]$ pg_ctl start -D $PGDATA
server starting

备注:数据库能启来,但查看日志,报以下错误:

2.9 csv 日志

1
2
3
4
[pg93@redhat6 pg_log]$ tail -f postgresql-2013-07-02_062642.csv
",,,,,,,,,""
2013-07-02 06:27:17.672 CST,,,3704,,51d20245.e78,1,,2013-07-02 06:27:17 CST,,0,LOG,00000,"started streaming WAL from primary at 1/22000000 on timeline 1",,,,,,,,,""
2013-07-02 06:27:17.674 CST,,,3704,,51d20245.e78,2,,2013-07-02 06:27:17 CST,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000022 has already been removed

备注:目标的错误已重现,在主库的 $PGDATA/pg_xlog 目录里已找不到 000000010000000100000022 文件了,因为 XLOG 文件已被循环使用覆盖了,但在归档目录 /archive/pg93 里可以找到。

2.10 复制主库归档目录的 WAL 到备节点

1
[pg93@redhatB pg93]$ scp /archive/pg93/* pg93@192.168.1.35:/archive/pg93

备注:这时把主节点上归档目录的 WAL 文件复制到备节点的归档目录。

2.11 修改备库 recovery.conf 文件的以下参数

1
restore_command = 'cp /archive/pg93/%f %p'

备注:其它参数略, 参考 PostgreSQL:使用 pg_basebackup 搭建流复制环境

2.12 重启备库

1
2
3
4
5
[pg93@redhat6 pg_root]$ pg_ctl stop -m fast -D $PGDATA
waiting for server to shut down.... done
server stopped
[pg93@redhat6 pg_root]$ pg_ctl start -D $PGDATA
server starting

2.13 查看备库日志

1
2
3
4
5
6
7
8
9
10
11
12
[pg93@redhat6 pg_log]$ tail -f postgresql-2013-07-02_063606.csv
2013-07-02 06:36:08.123 CST,,,4008,,51d20456.fa8,4,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"redo starts at 1/200002BC",,,,,,,,,""
2013-07-02 06:36:08.592 CST,,,4008,,51d20456.fa8,5,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000021"" from archive",,,,,,,,,""
2013-07-02 06:36:08.618 CST,,,4008,,51d20456.fa8,6,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"consistent recovery state reached at 1/22000000",,,,,,,,,""
2013-07-02 06:36:08.630 CST,,,4006,,51d20456.fa6,1,,2013-07-02 06:36:06 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""
2013-07-02 06:36:10.014 CST,,,4008,,51d20456.fa8,7,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000022"" from archive",,,,,,,,,""
2013-07-02 06:36:10.800 CST,,,4008,,51d20456.fa8,8,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000023"" from archive",,,,,,,,,""
2013-07-02 06:36:12.139 CST,,,4008,,51d20456.fa8,9,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000024"" from archive",,,,,,,,,""
2013-07-02 06:36:13.937 CST,,,4008,,51d20456.fa8,10,,2013-07-02 06:36:06 CST,1/0,0,LOG,00000,"restored log file ""000000010000000100000025"" from archive",,,,,,,,,""
....
省略部分内容
2013-07-02 06:36:47.400 CST,,,4059,,51d2047f.fdb,1,,2013-07-02 06:36:47 CST,,0,LOG,00000,"started streaming WAL from primary at 1/43000000 on timeline 1",,,,,,,,,""

备注:此时备库从归档目录取到所需的 WAL 后开始拼命追主库,直到看到上面日志的最后一行信息时表示已完全追上主库。

验证

3.1 主库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[pg93@redhatB ~]$ psql
psql (9.3beta1)
Type "help" for help.
postgres=# select * from test_1 order by create_time desc limit 3;
id | create_time
----+---------------------
5 | 2013-07-02 06:38:07
5 | 2013-07-02 06:26:11
5 | 2013-07-02 06:26:09
(3 rows)
postgres=# select count(*) from test_1;
count
-------
39
(1 row)

3.2 备库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[pg93@redhat6 pg_root]$ psql 
psql (9.3beta1)
Type "help" for help.
postgres=# select * from test_1 order by create_time desc limit 3;
id | create_time
----+---------------------
5 | 2013-07-02 06:38:07
5 | 2013-07-02 06:26:11
5 | 2013-07-02 06:26:09
(3 rows)
postgres=# select count(*) from test_1;
count
-------
39
(1 row)

备注:此时在备库停掉过程中新增的数据也已经同步到备库了,恢复成功。

总结

  1. 对于比较繁忙的库,建议给 pg_xlog 分配较大的存储,从而能保留较多的 WAL 文件,在备节点
    停机维护后,能够获得更多的停库时间。

  2. 对于更繁忙的大数据库,例如 TB 级,如果没有足够的存储长期开启归档,那么至少在备节点需要
    停库维护时的这段时间把归档开启,否则当备库启来后追不上主库时,TB 级库重做 standby 的代
    价是可想而知。

    参考

最后推荐和张文升共同编写的《PostgreSQL实战》,本书基于PostgreSQL 10 编写,共18章,重点介绍SQL高级特性、并行查询、分区表、物理复制、逻辑复制、备份恢复、高可用、性能优化、PostGIS等,涵盖大量实战用例!

购买链接:https://item.jd.com/12405774.html

PostgreSQL实战
感谢支持!
0%