数据库
首页 > 数据库> > PostgreSQL 欺骗优化器之扩展统计

PostgreSQL 欺骗优化器之扩展统计

作者:互联网

什么是扩展统计

扩展统计对象, 追踪指定表、外部表或物化视图的数据。 目前支持的种类:

本文仅讨论n-distinct统计信息,在优化器中的作用。手工修改统计信息,使得执行计划发生改变。

数据准备

建立一个大表,这个表模拟商业交易明细记录。这个表,不仅有海量的数据,也具有大量的维度信息。

create table t_order as
select id,
       'dim01_' || (random() * 5)::int                 as dim01,
       'dim02_' || (random() * 5)::int                 as dim02,
       'dim03_' || (random() * 5)::int                 as dim03,
       'dim04_' || (random() * 5)::int                 as dim04,
       'dim05_' || (random() * 5)::int                 as dim05,
       'dim06_' || (random() * 5)::int                 as dim06,
       'dim07_' || (random() * 5)::int                 as dim07,
       'dim08_' || (random() * 5)::int                 as dim08,
       'dim09_' || (random() * 5)::int                 as dim09,
       'dim10_' || (random() * 5)::int                 as dim10,
       'dim11_' || (random() * 5)::int                 as dim11,
       'dim12_' || (random() * 5)::int                 as dim12,
       'dim13_' || (random() * 5)::int                 as dim13,
       'dim14_' || (random() * 5)::int                 as dim14,
       'dim15_' || (random() * 5)::int                 as dim15,
       'dim16_' || (random() * 5)::int                 as dim16,
       'dim17_' || (random() * 5)::int                 as dim17,
       'dim18_' || (random() * 5)::int                 as dim18,
       'dim19_' || (random() * 5)::int                 as dim19,
       'dim20_' || (random() * 5)::int                 as dim20,
       'dim21_' || (random() * 5)::int                 as dim21,
       'dim22_' || (random() * 5)::int                 as dim22,
       'dim23_' || (random() * 5)::int                 as dim23,
       'dim24_' || (random() * 5)::int                 as dim24,
       'dim25_' || (random() * 5)::int                 as dim25,
       'dim26_' || (random() * 5)::int                 as dim26,
       'dim27_' || (random() * 5)::int                 as dim27,
       'dim28_' || (random() * 5)::int                 as dim28,
       'dim29_' || (random() * 5)::int                 as dim29,
       'dim30_' || (random() * 5)::int                 as dim30,
       (random() * 100)::numeric(20, 2)                as amount,
       (now() - (random() * 10)::numeric(10, 2))::date as created
from (select generate_series(1, 10000000) id) t;

10000000 rows affected in 1 m 8 s 747 ms

select pg_table_size('t_order')/1024/1024;
 ?column?
----------
     2893
(1 行记录)

用例数据有1000万行,30个维度,有2893 MB。真实商业业务系统中,订单数据表,会达到10 TB,每日增量数据可以达到100 GB,比用例数据更加庞大。

查询需求

在报表系统中,还需要进一步将交易明细表的数据,生成所有维度的汇总数据表。汇总数据表,是按每个维度的基本度进行组合的聚合数据,各类报表是在查询结果的基础上,选取一个或几个维度,读取维度的细粒度,再次聚合计算而成。如果维度的细粒度较低,最终的海量的交易明细数据,会压缩成少量的维度明细聚合记录。在查询的计划中,会使用两种聚合函数实现:HashAggregate与GroupAggregate。

select dim01, count(*) as cnt, sum(amount) as amount, ....
from t_order
group by dim01, .....

为了便于展现优化器选择HashAggregate和GroupAggregate的语句上的差别,用例将内存参数work_mem设置较小的数值。

set work_mem = 1024;

试验步骤

数据表分析之前

由于没有数据表的统计信息,分组列的行估值行数为200,行数估值总计是200的n次方或总行数。

按单列分组聚合,执行计划使用HashAggregate。

explain 
select count(*) , sum(amount) ,count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount),count(*) , sum(amount)
from t_order
group by dim01;

HashAggregate  (cost=1245372.49..1245381.99 rows=200 width=632)
  Group Key: dim01
  ->  Seq Scan on t_order  (cost=0.00..470371.17 rows=10000017 width=52)

按多列分组聚合,执行计划使用GroupAggregate。

explain
select  count(*) , sum(amount)
from t_order
group by dim01,dim02 ;

GroupAggregate  (cost=3068597.60..3194097.81 rows=40000 width=104)
"  Group Key: dim01, dim02"
  ->  Sort  (cost=3068597.60..3093597.64 rows=10000017 width=84)
"        Sort Key: dim01, dim02"
        ->  Seq Scan on t_order  (cost=0.00..470371.17 rows=10000017 width=84)


数据表分析之后

由于已知数据表的统计信息,行数估值总计是多个分组列的统计值的乘积或总行数。


explain
select  count(*), sum(amount), count(*), sum(amount), count(*) 
from t_order
group by dim01, dim02, dim03, dim04 , dim05 ;

HashAggregate  (cost=720371.59..720488.23 rows=7776 width=128)
"  Group Key: dim01, dim02, dim03, dim04, dim05"
  ->  Seq Scan on t_order  (cost=0.00..470371.17 rows=10000017 width=46)

--增加聚合函数列
explain
select  count(*), sum(amount), count(*), sum(amount), count(*), sum(amount)
from t_order
group by dim01, dim02, dim03, dim04 , dim05 ;

GroupAggregate  (cost=2248285.10..2548421.69 rows=7776 width=160)
"  Group Key: dim01, dim02, dim03, dim04, dim05"
  ->  Sort  (cost=2248285.10..2273285.14 rows=10000017 width=46)
"        Sort Key: dim01, dim02, dim03, dim04, dim05"
        ->  Seq Scan on t_order  (cost=0.00..470371.17 rows=10000017 width=46)

--增加分组列
explain
select  count(*)
from t_order
group by dim01, dim02, dim03, dim04 , dim05, dim06 ;

GroupAggregate  (cost=2248285.10..2448752.00 rows=46656 width=56)
"  Group Key: dim01, dim02, dim03, dim04, dim05, dim06"
  ->  Sort  (cost=2248285.10..2273285.14 rows=10000017 width=48)
"        Sort Key: dim01, dim02, dim03, dim04, dim05, dim06"
        ->  Seq Scan on t_order  (cost=0.00..470371.17 rows=10000017 width=48)

通过distinct分析

distinct子句具有相同的性质。查询语句仅包含distinct多个维度列,不同的估值行数,会影响计划的选择。

explain analyse
select distinct dim01 , dim02 , dim03 , dim04 , dim05 ,dim06 
from t_order;

HashAggregate  (cost=620371.43..620837.99 rows=46656 width=48) (actual time=4422.376..4427.546 rows=46656 loops=1)
"  Group Key: dim01, dim02, dim03, dim04, dim05, dim06"
  ->  Seq Scan on t_order  (cost=0.00..470371.17 rows=10000017 width=48) (actual time=0.013..710.242 rows=10000000 loops=1)
Planning Time: 0.081 ms
Execution Time: 4428.778 ms

explain analyse
select distinct dim01 , dim02 , dim03 , dim04 , dim05 ,dim06 , dim07
from t_order;

Unique  (cost=2316647.10..2516647.44 rows=279936 width=56) (actual time=64027.276..74826.618 rows=279456 loops=1)
  ->  Sort  (cost=2316647.10..2341647.14 rows=10000017 width=56) (actual time=64027.274..72741.372 rows=10000000 loops=1)
"        Sort Key: dim01, dim02, dim03, dim04, dim05, dim06, dim07"
        Sort Method: external merge  Disk: 645840kB
        ->  Seq Scan on t_order  (cost=0.00..470371.17 rows=10000017 width=56) (actual time=0.014..1469.598 rows=10000000 loops=1)
Planning Time: 0.080 ms
Execution Time: 74872.438 ms


阶段分析

优化器的计算公式,如果综合估值达到某个阈值后, 内存参数work_mem,不能满足HashAggregate需要的内存空间,就会选择GroupAggregate。GroupAggregate函数会先排序后聚合,所需要的更多的CPU时间。

扩展统计-多列统计信息

创建扩展统计对象,可以精确的获取精确的多列重复值,优化器选择了性能更好的HashAggregate。

create statistics t_order_01 (ndistinct ) on dim01 , dim02 , dim03, dim04 , dim05 , dim06 from t_order;
analyse t_order;

explain analyse
select count(*), sum(amount)
from t_order
group by dim01 , dim02 , dim03, dim04 , dim05 , dim06;

 HashAggregate  (cost=670365.21..670792.91 rows=34216 width=88) (actual time=6810.690..6822.648 rows=46656 loops=1)
   Group Key: dim01, dim02, dim03, dim04, dim05, dim06
   ->  Seq Scan on t_order  (cost=0.00..470369.07 rows=9999807 width=54) (actual time=0.008..693.920 rows=10000000 loops=1)
 Planning Time: 0.332 ms
 Execution Time: 6824.820 ms

扩展统计-查看信息

 select * from pg_statistic_ext where stxname='t_order_01';
-[ RECORD 1 ]+----------------
oid          | 670835
stxrelid     | 670816
stxname      | t_order_01
stxnamespace | 18629
stxowner     | 16384
stxkeys      | 2 3 4 5 6 7
stxkind      | {d}

stxkeys列值,对应“ dim01 , dim02 , dim03, dim04 , dim05 , dim06”等列的序号。


kingbase=# select stxname,stxdndistinct from pg_statistic_ext_data , pg_statistic_ext where  stxoid  = oid and stxname='t_order_01'; 

stxname       | t_order_01
stxdndistinct | {"2, 3": 36, "2, 4": 36, ... "3, 4": 36, ... "6, 7": 36, "2, 3, 4": 216, "2, 3, 5": 216, ... "5, 6, 7": 216, "2, 3, 4, 5": 1296, ... "2, 3, 4, 5, 6, 7": 827260}

stxdndistinct列值,是多列的所有组合数之集合,sum(C(n,[2-n]))个单元。如果n值过大,表分析用时就会非常大。

扩展统计-限制

扩展统计对象限制了统计元素的个数,不能超过8个。分析表的用时,随统计元素的数量,而加速增长。


create statistics t_order_01 (ndistinct ) on dim01 , dim02 , dim03, dim04 , dim05 , dim06,dim07,dim08,dim09 from t_order;

错误:  在一个统计信息中不能使用超过 8 个字段

分析用时统计表

元素数量 Time 增长%
2 649 ms
3 702 ms 8
4 872 ms 24
5 1338 ms 53
6 2452 ms 83
7 5185 ms 111
8 11659 ms 125

扩展统计-超限

如果在不能增加内存参数work_mem的数值,分组列又超出8个列,这样的情况需要下面的方法,可以绕过限制。

扩展统计-超限加强

如果分组的列数量非常大,可以将分组的列,分成若干个局部。每个局部单独建立扩展统计对象,然后参照“超限法”,产生统计值。这样可以满足内存允许下的不限数量的维度列的分组聚合需求。

--为数据表t_order创建了三个扩展统计对象,使用“超限法”更新统计值
explain (analyse,buffers )
select distinct dim01, dim02, dim03, dim04, dim05, dim06, dim07, dim08, dim09, dim10
, dim11, dim12, dim13, dim14, dim15, dim16, dim17, dim18, dim19, dim20
, dim21, dim22, dim23, dim24, dim25, dim26, dim17, dim28, dim29, dim30
from t_order;


 HashAggregate  (cost=1220372.45..1220372.46 rows=1 width=240) (actual time=12720.356..17644.089 rows=10000000 loops=1)
   Group Key: dim01, dim02, dim03, dim04, dim05, dim06, dim07, dim08, dim09, dim10, dim11, dim12, dim13, dim14, dim15, dim16, dim17, dim18, dim19, dim20, dim21, dim22, dim23, dim24, dim25, dim26, dim17, dim28, dim29, dim30
   Buffers: shared hit=370371
   ->  Seq Scan on t_order  (cost=0.00..470371.17 rows=10000017 width=240) (actual time=0.013..3613.716 rows=10000000 loops=1)
         Buffers: shared hit=370371
 Planning Time: 1.353 ms
 Execution Time: 17917.662 ms

可行性依据

突破扩展统计的数量限制

这个数量限制,只是在创建对象时起作用,他的目的,就是避免随列数增长,分析用时则指数增长。

建议通过语句的语法,可以实现优化组合的方式,减少组合的可能性,比如使用“()”来合并多列为一个单元。

统计信息值重复使用

由于统计元素的组合后的单元数过大,可以利用空闲时间,将常用的组合,预先计算。将计算结果存储在用户表,如果有新的数据表或分区产生,可以使用这个办法快速处理。

最后的话

优化器的作用是根据成本估值公式的计算结果,选择最佳的执行计划。公式需要以单利和多列统计数据。当缺失这些数据信息,优化器就会得出保守的执行计划,从而影响性能。希望今后,优化器可以推出激进模式,并可以固化查询的执行计划。

标签:rows,PostgreSQL,..,cost,dim02,dim01,欺骗,优化,order
来源: https://www.cnblogs.com/kingbase/p/16299956.html