PostgreSQL10:Multi-column Correlation Statistics

Multi-column Correlation Statistics

1
2
3
Add multi-column optimizer statistics to compute the correlation ratio and number of distinct values (Tomas  Vondra,  David  Rowley,  álvaro Herrera)

Real-world data frequently contains correlated data in table columns, which can easily fool the query planner into thinking WHERE clauses are more selective than they really are, which can cause some queries to become very slow. Multivariate statistics objects can be used to let the planner learn about this, which proofs it against making such mistakes. This manual section explains the feature in more detail, and this section shows some examples. This feature in PostgreSQL represents an advance in the state of the art for all SQL databases.

PostgreSQL10 增加多字段相关统计信息,听起来比较费解,通过以下例子演示下:

多字段相关性测试

创建测试表

1
2
3
4
5
6
7
8
9
10
11
12
13
francs=> create table test_t( a int4, b int4);
CREATE TABLE

francs=> insert into test_t(a,b) select n%100,n%100 from generate_series(1,10000) n;
INSERT 0 10000

francs=> select * from test_t limit 3;
a | b
---+---
1 | 1
2 | 2
3 | 3
(3 rows)

根据字段a查询

执行计划如下:

1
2
3
4
5
6
7
8
9
francs=> explain analyze select  *  from  test_t  where a=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on test_t (cost=0.00..148.00 rows=100 width=8) (actual time=0.017..0.743 rows=100 loops=1)
Filter: (a = 1)
Rows Removed by Filter: 9900
Planning time: 0.137 ms
Execution time: 0.765 ms
(5 rows)

备注: Seq Scan on test_t (cost=0.00..148.00 rows=100 width=8) 表示执行计划分析的信息,这时SQL还没执行,(actual time=0.017..0.743 rows=100 loops=1) 表示SQL实际执行时返回的信息, 可以看到SQL实际扫描的行数和执行计划一致,都为 100 rows。

根据字段b查询

执行计划如下:

1
2
3
4
5
6
7
8
9
francs=> explain analyze select  *  from  test_t  where b=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on test_t (cost=0.00..148.00 rows=100 width=8) (actual time=0.018..0.837 rows=100 loops=1)
Filter: (b = 1)
Rows Removed by Filter: 9900
Planning time: 0.087 ms
Execution time: 0.861 ms
(5 rows)

备注:可以看到SQL实际扫描的行数和执行计划一致,都为 100 rows。

根据 a=1 and b=1 查询

执行计划如下:

1
2
3
4
5
6
7
8
9
francs=> explain analyze select  *  from  test_t  where a=1  and b=1;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on test_t (cost=0.00..173.00 rows=1 width=8) (actual time=0.029..0.765 rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
Planning time: 0.078 ms
Execution time: 0.790 ms
(5 rows)

备注:这里执行计划的 rows=1,而实际SQL返回 rows=100,执行计划认为 a 字段和 b 字段为互不相关独立的两个字段,因此将两个字段的选择性相乘,0.01*0.01=0.0001,所以执行计划中的 rows=1;在生产案例中这种场景会得到错误的执行计划,影响SQL性能, PG 10 可以应对这个问题,支持多字段的相关性统计信息,如下:

创建统计信息

CREATE STATISTICS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
francs=> CREATE STATISTICS stts_test_t ON a, b FROM test_t;
WARNING: unrecognized node type: 333
CREATE STATISTICS

francs=> analyze test_t;
ANALYZE

francs=> explain analyze select * from test_t where a=1 and b=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on test_t (cost=0.00..173.00 rows=100 width=8) (actual time=0.018..0.741 rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
Planning time: 0.130 ms
Execution time: 0.765 ms
(5 rows)

备注:创建 a,b 字段的相关性后,执行计划中的 rows 值变成 100 了。

查询 STATISTICS

1
2
3
4
5
francs=>  select  *  from pg_statistic_ext where stxname='stts_test_t';
stxrelid | stxname | stxnamespace | stxowner | stxkeys | stxkind | stxndistinct| stxdependencies
----------+-------------+--------------+----------+---------+---------+---------------+------------------------------------------
24875 | stts_test_t | 16387 | 16384 | 1 2 | {d,f} | {"1, 2": 100} | {"1 => 2": 1.000000, "2 => 1": 1.000000}
(1 row)

  • stxname 指 statistics 名称;
  • stxkeys 指 statistics 对应的字段名称;
  • stxkind 指 statistic 的类型,d=n-distinct statistics ,f=functional dependency statistics;
  • stxndistinct 组合字段的 distinct 值。

注意事项

  1. 当SQL 语句where 中出现多个字段等于号后接常量的场景时(where a=? and b=? …),可以使用此特性;
  2. 手册上指出 Functional Dependencies 目前仅支持等于号后接常量的场景 ,其它场景暂不支持,比如等于号后接表达式、>=、<=、LIKE 等操作

参考

最后推荐和张文升共同编写的《PostgreSQL实战》,本书基于PostgreSQL 10 编写,共18章,重点介绍SQL高级特性、并行查询、分区表、物理复制、逻辑复制、备份恢复、高可用、性能优化、PostGIS等,涵盖大量实战用例!

购买链接:https://item.jd.com/12405774.html

PostgreSQL实战
感谢支持!
0%