Multi-column Correlation Statistics
1 | Add multi-column optimizer statistics to compute the correlation ratio and number of distinct values (Tomas Vondra, David Rowley, álvaro Herrera) |
PostgreSQL10 增加多字段相关统计信息,听起来比较费解,通过以下例子演示下:
多字段相关性测试
创建测试表
1 | francs=> create table test_t( a int4, b int4); |
根据字段a查询
执行计划如下:1
2
3
4
5
6
7
8
9francs=> explain analyze select * from test_t where a=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on test_t (cost=0.00..148.00 rows=100 width=8) (actual time=0.017..0.743 rows=100 loops=1)
Filter: (a = 1)
Rows Removed by Filter: 9900
Planning time: 0.137 ms
Execution time: 0.765 ms
(5 rows)
备注: Seq Scan on test_t (cost=0.00..148.00 rows=100 width=8) 表示执行计划分析的信息,这时SQL还没执行,(actual time=0.017..0.743 rows=100 loops=1) 表示SQL实际执行时返回的信息, 可以看到SQL实际扫描的行数和执行计划一致,都为 100 rows。
根据字段b查询
执行计划如下:1
2
3
4
5
6
7
8
9francs=> explain analyze select * from test_t where b=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on test_t (cost=0.00..148.00 rows=100 width=8) (actual time=0.018..0.837 rows=100 loops=1)
Filter: (b = 1)
Rows Removed by Filter: 9900
Planning time: 0.087 ms
Execution time: 0.861 ms
(5 rows)
备注:可以看到SQL实际扫描的行数和执行计划一致,都为 100 rows。
根据 a=1 and b=1 查询
执行计划如下:1
2
3
4
5
6
7
8
9francs=> explain analyze select * from test_t where a=1 and b=1;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on test_t (cost=0.00..173.00 rows=1 width=8) (actual time=0.029..0.765 rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
Planning time: 0.078 ms
Execution time: 0.790 ms
(5 rows)
备注:这里执行计划的 rows=1,而实际SQL返回 rows=100,执行计划认为 a 字段和 b 字段为互不相关独立的两个字段,因此将两个字段的选择性相乘,0.01*0.01=0.0001,所以执行计划中的 rows=1;在生产案例中这种场景会得到错误的执行计划,影响SQL性能, PG 10 可以应对这个问题,支持多字段的相关性统计信息,如下:
创建统计信息
CREATE STATISTICS1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16francs=> CREATE STATISTICS stts_test_t ON a, b FROM test_t;
WARNING: unrecognized node type: 333
CREATE STATISTICS
francs=> analyze test_t;
ANALYZE
francs=> explain analyze select * from test_t where a=1 and b=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on test_t (cost=0.00..173.00 rows=100 width=8) (actual time=0.018..0.741 rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
Planning time: 0.130 ms
Execution time: 0.765 ms
(5 rows)
备注:创建 a,b 字段的相关性后,执行计划中的 rows 值变成 100 了。
查询 STATISTICS1
2
3
4
5francs=> select * from pg_statistic_ext where stxname='stts_test_t';
stxrelid | stxname | stxnamespace | stxowner | stxkeys | stxkind | stxndistinct| stxdependencies
----------+-------------+--------------+----------+---------+---------+---------------+------------------------------------------
24875 | stts_test_t | 16387 | 16384 | 1 2 | {d,f} | {"1, 2": 100} | {"1 => 2": 1.000000, "2 => 1": 1.000000}
(1 row)
- stxname 指 statistics 名称;
- stxkeys 指 statistics 对应的字段名称;
- stxkind 指 statistic 的类型,d=n-distinct statistics ,f=functional dependency statistics;
- stxndistinct 组合字段的 distinct 值。
注意事项
- 当SQL 语句where 中出现多个字段等于号后接常量的场景时(where a=? and b=? …),可以使用此特性;
- 手册上指出 Functional Dependencies 目前仅支持等于号后接常量的场景 ,其它场景暂不支持,比如等于号后接表达式、>=、<=、LIKE 等操作