您现在的位置是：首页 > 数据库

当前栏目

PostgreSQL 多维分析 CASE

postgresql case

2023-09-14 09:02:06 时间

昨天和小米的童鞋交流，听说的一个痛点。
也是很多给企业做BI分析的开发小伙伴，可能经常会遇到这样的痛苦，运营人员今天问你要这样的维度报表，明天换个维度又来"折腾"你。
对于开发的小伙伴，确实是非常痛苦的事情，那么有什么好的应对策略，而且对运营来说可能会显得比较高逼格呢？

多维分析派上用场，比如你的表有10个字段，允许运营人员以任意字段组合，产生报表。
很多商业数据库都带了这个功能，开源数据库带这个功能的不多。PostgreSQL真是业界良心啊~~~

例子
假设有4个业务字段，一个时间字段。

postgres=# create table tab5(c1 int, c2 int, c3 int, c4 int, crt_time timestamp);

CREATE TABLE

生成一批测试数据

postgres=# insert into tab5 select 

trunc(100*random()), 

trunc(1000*random()), 

trunc(10000*random()), 

trunc(100000*random()), 

clock_timestamp() + (trunc(10000*random())|| hour)::interval 

from generate_series(1,1000000);

INSERT 0 1000000

postgres=# select * from tab5 limit 10;

 c1 | c2 | c3 | c4 | crt_time 

----+-----+------+-------+----------------------------

 72 | 46 | 3479 | 20075 | 2017-02-02 14:56:36.854218

 98 | 979 | 4491 | 83012 | 2017-06-13 08:56:36.854416

 54 | 758 | 5838 | 45956 | 2016-09-18 02:56:36.854427

 3 | 67 | 5148 | 74754 | 2017-01-01 01:56:36.854431

 42 | 650 | 7681 | 36495 | 2017-06-20 15:56:36.854435

 4 | 472 | 6454 | 19554 | 2016-06-18 19:56:36.854438

 82 | 922 | 902 | 17435 | 2016-07-21 14:56:36.854441

 68 | 156 | 1028 | 13275 | 2017-07-16 10:56:36.854444

 0 | 674 | 7446 | 59386 | 2016-07-26 09:56:36.854447

 0 | 629 | 2022 | 52285 | 2016-11-04 13:56:36.85445

(10 rows)

创建一个统计结果表, 其中bitmap表示统计的字段组合, 用位置符0,1表示是否统计了该维度

create table stat_tab5 (c1 int, c2 int, c3 int, c4 int, time1 text, time2 text, time3 text, time4 text, cnt int8, bitmap text);

生成业务字段任意维度组合+4组时间任选一组的组合统计
PS (如果业务字段有空的情况，建议统计时用coalesce转一下，确保不会统计到空的情况)

insert into stat_tab5

select c1,c2,c3,c4,t1,t2,t3,t4,cnt, 

 || 

case when c1 is null then 0 else 1 end || 

case when c2 is null then 0 else 1 end || 

case when c3 is null then 0 else 1 end || 

case when c4 is null then 0 else 1 end || 

case when t1 is null then 0 else 1 end || 

case when t2 is null then 0 else 1 end || 

case when t3 is null then 0 else 1 end || 

case when t4 is null then 0 else 1 end

from 

select c1,c2,c3,c4,

to_char(crt_time, yyyy) t1, 

to_char(crt_time, yyyy-mm) t2, 

to_char(crt_time, yyyy-mm-dd) t3, 

to_char(crt_time, yyyy-mm-dd hh24) t4, 

count(*) cnt

from tab5 

group by 

cube(c1,c2,c3,c4), 

grouping sets(to_char(crt_time, yyyy), to_char(crt_time, yyyy-mm), to_char(crt_time, yyyy-mm-dd), to_char(crt_time, yyyy-mm-dd hh24))

INSERT 0 49570486

Time: 172373.714 ms

在bitmap上创建索引方便取数据

create index idx_stat_tab5_bitmap on stat_tab5 (bitmap);

用户勾选几个维度，取出数据

c1,c3,c4,t3 = bitmap(10110010)

postgres=# select c1,c3,c4,time3,cnt from stat_tab5 where bitmap=10110010 limit 10;

 c1 | c3 | c4 | time3 | cnt 

----+----+-------+------------+-----

 41 | 0 | 30748 | 2016-06-04 | 1

 69 | 0 | 87786 | 2016-06-04 | 1

 70 | 0 | 38805 | 2016-06-04 | 1

 79 | 0 | 65892 | 2016-06-08 | 1

 51 | 0 | 13615 | 2016-06-11 | 1

 47 | 0 | 42196 | 2016-06-28 | 1

 45 | 0 | 54736 | 2016-07-01 | 1

 50 | 0 | 21605 | 2016-07-02 | 1

 46 | 0 | 40888 | 2016-07-16 | 1

 41 | 0 | 90258 | 2016-07-17 | 1

(10 rows)

Time: 0.528 ms

postgres=# select * from stat_tab5 where bitmap=00001000 limit 10;

 c1 | c2 | c3 | c4 | time1 | time2 | time3 | time4 | cnt | bitmap 

----+----+----+----+-------+-------+-------+-------+--------+----------

 | | | | 2016 | | | | 514580 | 00001000

 | | | | 2017 | | | | 485420 | 00001000

(2 rows)

Time: 0.542 ms

执行计划，可以看到优雅的sort，一次sort多次rolldown, 不是简单的union all哦。

 QUERY PLAN 

------------------------------------------------------------------------------------------------------------------

 Insert on stat_tab5 (cost=208059.84..142986926.23 rows=1536000000 width=184)

 - Subquery Scan on t (cost=208059.84..142986926.23 rows=1536000000 width=184)

 - GroupAggregate (cost=208059.84..35466926.23 rows=1536000000 width=152)

 Group Key: (to_char(tab5.crt_time, yyyy::text)), tab5.c4, tab5.c2, tab5.c1

 Group Key: (to_char(tab5.crt_time, yyyy::text)), tab5.c4, tab5.c2

 Group Key: (to_char(tab5.crt_time, yyyy::text)), tab5.c4

 Group Key: (to_char(tab5.crt_time, yyyy::text))

 Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text))

 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text))

 Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd::text))

 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd::text))

 Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, yyyy-mm::text))

 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, yyyy-mm::text))

 Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, yyyy::text))

 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, yyyy::text))

 Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c1

 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c1

 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text))

 Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c1

 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c1

 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd::text))

 Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c1

 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c1

 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, yyyy-mm::text))

 Sort Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, yyyy::text)), tab5.c2, tab5.c3

 Group Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, yyyy::text)), tab5.c2, tab5.c3

 Group Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, yyyy::text))

 Sort Key: tab5.c2, (to_char(tab5.crt_time, yyyy::text)), tab5.c3, tab5.c4

 Group Key: tab5.c2, (to_char(tab5.crt_time, yyyy::text)), tab5.c3, tab5.c4

 Group Key: tab5.c2, (to_char(tab5.crt_time, yyyy::text)), tab5.c3

 Group Key: tab5.c2, (to_char(tab5.crt_time, yyyy::text))

 Sort Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c2, tab5.c3

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c2, tab5.c3

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c2

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text))

 Sort Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c2, tab5.c3

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c2, tab5.c3

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c2

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm-dd::text))

 Sort Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c1, tab5.c4

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c1, tab5.c4

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c1

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text))

 Sort Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c2, tab5.c3

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c2, tab5.c3

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c2

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy-mm::text))

 Sort Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c1, tab5.c4

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c1, tab5.c4

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c1

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm-dd::text))

 Sort Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c1, tab5.c4

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c1, tab5.c4

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c1

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy-mm::text))

 Sort Key: tab5.c3, (to_char(tab5.crt_time, yyyy::text)), tab5.c1, tab5.c4

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy::text)), tab5.c1, tab5.c4

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy::text)), tab5.c1

 Group Key: tab5.c3, (to_char(tab5.crt_time, yyyy::text))

 Sort Key: tab5.c1, (to_char(tab5.crt_time, yyyy::text)), tab5.c2, tab5.c3

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy::text)), tab5.c2, tab5.c3

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy::text)), tab5.c2

 Group Key: tab5.c1, (to_char(tab5.crt_time, yyyy::text))

 Sort Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c1, tab5.c2, tab5.c3

 Group Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c1, tab5.c2, tab5.c3

 Group Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c1

 Group Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm::text))

 Sort Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c1, tab5.c2, tab5.c3

 Group Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c1, tab5.c2, tab5.c3

 Group Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c1

 Group Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd::text))

 Sort Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c1, tab5.c2, tab5.c3

 Group Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c1, tab5.c2, tab5.c3

 Group Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c1

 Group Key: tab5.c4, (to_char(tab5.crt_time, yyyy-mm-dd hh24::text))

 Sort Key: (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c2, tab5.c3, tab5.c4

 Group Key: (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c2, tab5.c3, tab5.c4

 Group Key: (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c2, tab5.c3

 Group Key: (to_char(tab5.crt_time, yyyy-mm::text)), tab5.c2

 Group Key: (to_char(tab5.crt_time, yyyy-mm::text))

 Sort Key: (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c2, tab5.c3, tab5.c4

 Group Key: (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c2, tab5.c3, tab5.c4

 Group Key: (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c2, tab5.c3

 Group Key: (to_char(tab5.crt_time, yyyy-mm-dd hh24::text)), tab5.c2

 Group Key: (to_char(tab5.crt_time, yyyy-mm-dd hh24::text))

 Sort Key: (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c2, tab5.c3, tab5.c4

 Group Key: (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c2, tab5.c3, tab5.c4

 Group Key: (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c2, tab5.c3

 Group Key: (to_char(tab5.crt_time, yyyy-mm-dd::text)), tab5.c2

 Group Key: (to_char(tab5.crt_time, yyyy-mm-dd::text))

 - Sort (cost=208059.84..210559.84 rows=1000000 width=144)

 Sort Key: (to_char(tab5.crt_time, yyyy::text)), tab5.c4, tab5.c2, tab5.c1

 - Seq Scan on tab5 (cost=0.00..26370.00 rows=1000000 width=144)

(93 rows)

后续的优化手段
.1. 分区表
因为统计维度多，所以统计结果是非常庞大的。
数据分区可以帮组用户解决查询效率的问题。
例如按bitmap分区之后，在每个分区表再按时间维度分区。

.2. 流式计算
使用pipelinedb结合cube和grouping sets，把以上的统计改成流式统计，可以提升用户体验，快速得到报表。

.3. 使用MPP产品来提升数据的存储量和计算能力，例如Greenplum。

.4. 使用PostgreSQL 9.6 的CPU并行技术，性能可以有明显的提升。

.5. 使用GPU加速，也可以有非常巨大的提升。

.6. 更大数据量可以使用Greenplum，支持MPP。实现PB级别多维分析。

参考
https://www.postgresql.org/docs/9.6/static/queries-table-expressions.html#QUERIES-GROUPING-SETS

无序UUID会带来很多问题，例如索引分裂膨胀，离散IO，WAL膨胀等，详见以前的分析。 Regular random UUIDs are distributed uniformly over the whole range of possible values.
某个系统存储了会员的标签，以及标签的描述信息。业务上需要通过会员ID得到会员的标签，再得到描述信息。每个会员有若干标签，原来是这么存储的 1、会员标签表，人数5亿左右，每个人平均有几百个标签，1500亿行左右。
SQL UNION结构必须使可能不相似的类型匹配成为一个单一的结果集。该决定算法被独立地应用到一个联合查询的每个输出列。 INTERSECT和EXCEPT采用和UNION相同的方法来决定不相似的类型。
PostgreSQL 如何高效解决按任意字段分词检索的问题 - case 1 在有些应用场景中，可能会涉及多个字段的匹配。例如这样的场景，一张表包含了几个字段，分别为歌手，曲目，专辑名称，作曲，歌词，。。。用户可能要在所有的字段中以分词的方式匹配刘德华，任意字段匹配即返回TRUE。传统的做法是每个字段建立分词索引，然后挨个匹配。

猜你喜欢

HashMap 源码解析-番外（为何1.7hash会死循环）
学习MySQL学习之路——尚硅谷之歌（尚硅谷mysql）
Redis获取当前连接数一个实例探究（Redis获取当前连接数）
Linux下添加头文件路径的方法（linux添加头文件路径）
从草原来的虚拟人「奶思」火了，虚拟代言人的未来有多可期？
让Idea帮你登录Oracle数据库（idea登录oracle）
CSS制作一个半透明边框
Oracle中IN操作符的用法（oracle中in的用法）
Redis内存碎片和Pipeline管道
通过正则表达式删除空行的方法
自建直播间无从下手？腾讯云Lighthouse给你新选择！
编程Linux分段编程：一种更高效的技术（linux分段）
Dom与浏览器兼容性说明
socket到底是什么？
Zencart模板结构和设计详解[通俗易懂]
从Oracle入库到瓶颈怎样解决（oracle 入库瓶颈）

相关主题

如何用postgresql
postgresql是什么
postgresql开源吗
安装PostgreSql
postgresql表如何去重
在Linux下安装PostgreSQL
PostgreSQL 创建数据库
Oracle转PostgreSQL

zl程序教程

当前栏目

PostgreSQL 多维分析 CASE

相关文章