zl程序教程

您现在的位置是:首页 >  数据库

当前栏目

HTAP数据库 PostgreSQL 场景与性能测试之 26 - (OLTP) NOT IN、NOT EXISTS 查询

数据库postgresql测试性能 查询 in not 场景
2023-09-27 14:28:17 时间
背景

PostgreSQL是一个历史悠久的数据库,历史可以追溯到1973年,最早由2014计算机图灵奖得主,关系数据库的鼻祖Michael_Stonebraker 操刀设计,PostgreSQL具备与Oracle类似的功能、性能、架构以及稳定性。

pic

PostgreSQL社区的贡献者众多,来自全球各个行业,历经数年,PostgreSQL 每年发布一个大版本,以持久的生命力和稳定性著称。

2017年10月,PostgreSQL 推出10 版本,携带诸多惊天特性,目标是胜任OLAP和OLTP的HTAP混合场景的需求:

《最受开发者欢迎的HTAP数据库PostgreSQL 10特性》

1、多核并行增强

2、fdw 聚合下推

3、逻辑订阅

4、分区

5、金融级多副本

6、json、jsonb全文检索

7、还有插件化形式存在的特性,如 向量计算、JIT、SQL图计算、SQL流计算、分布式并行计算、时序处理、基因测序、化学分析、图像分析 等。

pic

在各种应用场景中都可以看到PostgreSQL的应用:

pic

PostgreSQL近年来的发展非常迅猛,从知名数据库评测网站dbranking的数据库评分趋势,可以看到PostgreSQL向上发展的趋势:

pic

从每年PostgreSQL中国召开的社区会议,也能看到同样的趋势,参与的公司越来越多,分享的公司越来越多,分享的主题越来越丰富,横跨了 传统企业、互联网、医疗、金融、国企、物流、电商、社交、车联网、共享XX、云、游戏、公共交通、航空、铁路、军工、培训、咨询服务等 行业。

接下来的一系列文章,将给大家介绍PostgreSQL的各种应用场景以及对应的性能指标。

环境

环境部署方法参考:

《PostgreSQL 10 + PostGIS + Sharding(pg_pathman) + MySQL(fdw外部表) on ECS 部署指南(适合新用户)》

阿里云 ECS:56核,224G,1.5TB*2 SSD云盘。

操作系统:CentOS 7.4 x64

数据库版本:PostgreSQL 10

PS:ECS的CPU和IO性能相比物理机会打一定的折扣,可以按下降1倍性能来估算。跑物理主机可以按这里测试的性能乘以2来估算。

场景 - NOT IN、NOT EXISTS 查询 (OLTP) 1、背景

not in 查询,多用在排除多个输入值场景。

实际上PostgreSQL支持很多种排除多个输入值的语法。

1、not in (...)

2、not in (table or subquery or srf)

3、 all (array)

4、not exists (select 1 from (values (),(),...) as t(id) where x.?=t.id)

5、 ? and ? and ? and .....

6、left join others b on (a.?=b.?) where b.* is null

7、select ? from a except select ? from b,适用于输出字段与条件字段相同的情形。

他们的执行计划分别如下,(1亿记录,排除多个输入值。):

表越大、或Filter的值越多,使用 left join, not exist, except 的效果越好。

postgres=# explain select * from a where id not in (1,2,3,4,5); 

 QUERY PLAN 

-------------------------------------------------------------- 

 Seq Scan on a (cost=0.00..255958.10 rows=10000001 width=45) 

 Filter: (id ALL ({1,2,3,4,5}::integer[])) 

(2 rows) 

postgres=# explain select * from a where id all (array[1,2,3,4,5]); 

 QUERY PLAN 

-------------------------------------------------------------- 

 Seq Scan on a (cost=0.00..255958.10 rows=10000001 width=45) 

 Filter: (id ALL ({1,2,3,4,5}::integer[])) 

(2 rows) 

postgres=# explain select * from a where id all (array(select generate_series(1,10))); 

 QUERY PLAN 

------------------------------------------------------------- 

 Seq Scan on a (cost=5.02..318463.15 rows=9999996 width=45) 

 Filter: (id ALL ($0)) 

 InitPlan 1 (returns $0) 

 - ProjectSet (cost=0.00..5.02 rows=1000 width=4) 

 - Result (cost=0.00..0.01 rows=1 width=0) 

(5 rows) 

postgres=# explain select * from a where id all (array(select id from (values (1),(2),(3),(4),(5)) t (id))); 

 QUERY PLAN 

--------------------------------------------------------------------- 

 Seq Scan on a (cost=0.06..318458.20 rows=9999996 width=45) 

 Filter: (id ALL ($0)) 

 InitPlan 1 (returns $0) 

 - Values Scan on "*VALUES*" (cost=0.00..0.06 rows=5 width=4) 

(4 rows) 

postgres=# explain select * from a where id not in (select id from (values (1),(2),(3),(4),(5)) t (id)); 

 QUERY PLAN 

--------------------------------------------------------------------- 

 Seq Scan on a (cost=0.07..218458.15 rows=5000003 width=45) 

 Filter: (NOT (hashed SubPlan 1)) 

 SubPlan 1 

 - Values Scan on "*VALUES*" (cost=0.00..0.06 rows=5 width=4) 

(4 rows) 

postgres=# explain select * from a where not exists (select 1 from (values (1),(2),(3),(4),(5)) t (id) where t.id=a.id); 

 QUERY PLAN 

----------------------------------------------------------------------------------- 

 Merge Anti Join (cost=0.56..301364.14 rows=10000001 width=45) 

 Merge Cond: (a.id = "*VALUES*".column1) 

 - Index Scan using a_pkey on a (cost=0.43..276363.92 rows=10000006 width=45) 

 - Sort (cost=0.12..0.13 rows=5 width=4) 

 Sort Key: "*VALUES*".column1 

 - Values Scan on "*VALUES*" (cost=0.00..0.06 rows=5 width=4) 

(6 rows) 

postgres=# explain select * from a where id 1 and id 2 and id 3 and id 4 and id 5; 

 QUERY PLAN 

------------------------------------------------------------------------------- 

 Seq Scan on a (cost=0.00..318458.14 rows=10000001 width=45) 

 Filter: ((id 1) AND (id 2) AND (id 3) AND (id 4) AND (id 5)) 

(2 rows) 

postgres=# explain with t1 as (select id from (values (1),(2),(3),(4),(5)) as t(id)) 

 select a.* from a left join t1 b on (a.id=b.id) where b.* is null; 

 QUERY PLAN 

--------------------------------------------------------------------- 

 Hash Left Join (cost=0.23..230958.36 rows=50000 width=45) 

 Hash Cond: (a.id = b.id) 

 Filter: (b.* IS NULL) 

 CTE t1 

 - Values Scan on "*VALUES*" (cost=0.00..0.06 rows=5 width=4) 

 - Seq Scan on a (cost=0.00..193458.06 rows=10000006 width=45) 

 - Hash (cost=0.10..0.10 rows=5 width=32) 

 - CTE Scan on t1 b (cost=0.00..0.10 rows=5 width=32) 

(8 rows) 

postgres=# explain select id from a except select id from (values (1),(2),(3),(4),(5)) as t(id); 

 QUERY PLAN 

----------------------------------------------------------------------------------------------- 

 SetOp Except (cost=1538166.72..1588166.78 rows=10000006 width=8) 

 - Sort (cost=1538166.72..1563166.75 rows=10000011 width=8) 

 Sort Key: "*SELECT* 1".id 

 - Append (cost=0.00..293458.23 rows=10000011 width=8) 

 - Subquery Scan on "*SELECT* 1" (cost=0.00..293458.12 rows=10000006 width=8) 

 - Seq Scan on a (cost=0.00..193458.06 rows=10000006 width=4) 

 - Subquery Scan on "*SELECT* 2" (cost=0.00..0.11 rows=5 width=8) 

 - Values Scan on "*VALUES*" (cost=0.00..0.06 rows=5 width=4) 

(8 rows) 

2、设计

1亿记录,查询匹配多个输入值的性能。分别输入1,10,100,1000,10000,100000,1000000个值作为匹配条件。

1、not in (...)

2、not in (table or subquery or srf)

3、 all (array)

4、not exists (select 1 from (values (),(),...) as t(id) where x.?=t.id)

5、 ? and ? and ? and .....

6、left join others b on (a.?=b.?) where b.* is null

7、select ? from a except select ? from b,适用于输出字段与条件字段相同的情形。

3、准备测试表
create table t_in_test (id int primary key, info text, crt_time timestamp); 

4、准备测试函数(可选) 5、准备测试数据
insert into t_in_test select generate_series(1,100000000), md5(random()::text), clock_timestamp(); 

6、准备测试脚本
set parallel_setup_cost =0; 

set parallel_tuple_cost =0; 

set max_parallel_workers_per_gather =28; 

alter table t_in_test set (parallel_workers =28); 

1、not in (...)

1,10,100,1000,10000,100000,1000000 个输入值的测试性能

do language plpgsql $$ 

declare 

 arr text; 

 ts timestamp := clock_timestamp(); 

 mx int8; 

begin 

 for i in 0..6 loop 

 mx := (1*(10^i))::int8; 

 select string_agg((random()*100000)::int::text, ,) into arr from generate_series(1, mx); 

 ts := clock_timestamp(); 

 execute select * from t_in_test where id not in (||arr||); 

 raise notice %: %, mx, clock_timestamp()-ts; 

 end loop; 

end; 

$$ ; 

2、not in (table or subquery or srf)

1,10,100,1000,10000,100000,1000000 个输入值的测试性能

do language plpgsql $$ 

declare 

 arr text; 

 ts timestamp := clock_timestamp(); 

 mx int8; 

begin 

 for i in 0..6 loop 

 mx := (1*(10^i))::int8; 

 ts := clock_timestamp(); 

 perform * from t_in_test where not id in ( select (random()*100000)::int from generate_series(1, mx) ); 

 raise notice %: %, mx, clock_timestamp()-ts; 

 end loop; 

end; 

$$ ; 

3、 all (array)

1,10,100,1000,10000,100000,1000000 个输入值的测试性能

do language plpgsql $$ 

declare 

 arr int[]; 

 ts timestamp := clock_timestamp(); 

 mx int8; 

begin 

 for i in 0..6 loop 

 mx := (1*(10^i))::int8; 

 select array_agg((random()*100000)::int) into arr from generate_series(1, mx); 

 ts := clock_timestamp(); 

 perform * from t_in_test where id all ( arr ); 

 raise notice %: %, mx, clock_timestamp()-ts; 

 end loop; 

end; 

$$ ; 

4、not exists (select 1 from (values (),(),...) as t(id) where x.?=t.id)

1,10,100,1000,10000,100000,1000000 个输入值的测试性能

do language plpgsql $$ 

declare 

 ts timestamp := clock_timestamp(); 

 mx int8; 

begin 

 for i in 0..6 loop 

 mx := (1*(10^i))::int8; 

 ts := clock_timestamp(); 

 perform * from t_in_test where not exists ( select 1 from ( select (random()*100000)::int id from generate_series(1,mx) ) t where t_in_test.id=t.id ); 

 raise notice %: %, mx, clock_timestamp()-ts; 

 end loop; 

end; 

$$ ; 

6、left join others b on (a.?=b.?) where b.* is null

do language plpgsql $$ 

declare 

 ts timestamp := clock_timestamp(); 

 mx int8; 

begin 

 for i in 0..6 loop 

 mx := (1*(10^i))::int8; 

 ts := clock_timestamp(); 

 perform a.* from t_in_test a left join (select (random()*100000)::int id from generate_series(1,mx)) b on (a.id=b.id) where b.* is null; 

 raise notice %: %, mx, clock_timestamp()-ts; 

 end loop; 

end; 

$$ ; 

7、select ? from a except select ? from b,适用于输出字段与条件字段相同的情形。

do language plpgsql $$ 

declare 

 ts timestamp := clock_timestamp(); 

 mx int8; 

begin 

 for i in 0..6 loop 

 mx := (1*(10^i))::int8; 

 ts := clock_timestamp(); 

 perform a.id from t_in_test a except select (random()*100000)::int id from generate_series(1,mx); 

 raise notice %: %, mx, clock_timestamp()-ts; 

 end loop; 

end; 

$$ ; 

7、测试

1、not in (...)

1,10,100,1000,10000,100000,1000000 个输入值的测试性能

NOTICE: 1: 00:00:20.760034 

NOTICE: 10: 00:00:27.766224 

NOTICE: 100: 00:01:22.95002 

NOTICE: 1000: 00:10:16.690793 

.......... 

10000开始很久也没跑出来。继续看后面其他方法的测试。 

2、not in (table or subquery or srf)

1,10,100,1000,10000,100000,1000000 个输入值的测试性能

----- 

3、 all (array)

1,10,100,1000,10000,100000,1000000 个输入值的测试性能

----- 

4、not exists (select 1 from (values (),(),...) as t(id) where x.?=t.id)

1,10,100,1000,10000,100000,1000000 个输入值的测试性能

NOTICE: 1: 00:00:35.253582 

NOTICE: 10: 00:00:35.256638 

NOTICE: 100: 00:00:35.164034 

NOTICE: 1000: 00:00:35.417756 

NOTICE: 10000: 00:00:35.205454 

NOTICE: 100000: 00:00:35.458987 

NOTICE: 1000000: 00:00:35.447743 

6、a left join others b on (a.?=b.?) where b.* is null

1,10,100,1000,10000,100000,1000000 个输入值的测试性能

NOTICE: 1: 00:00:36.474715 

NOTICE: 10: 00:00:36.53191 

NOTICE: 100: 00:00:36.60439 

NOTICE: 1000: 00:00:36.534846 

NOTICE: 10000: 00:00:36.574136 

NOTICE: 100000: 00:00:36.519582 

NOTICE: 1000000: 00:00:37.675594 

7、select ? from a except select ? from b,适用于输出字段与条件字段相同的情形。

1,10,100,1000,10000,100000,1000000 个输入值的测试性能

NOTICE: 1: 00:00:50.566741 

NOTICE: 10: 00:00:50.051715 

NOTICE: 100: 00:00:50.098839 

NOTICE: 1000: 00:00:49.966196 

NOTICE: 10000: 00:00:50.608288 

NOTICE: 100000: 00:00:50.715218 

NOTICE: 1000000: 00:00:51.794935 

TPS 平均响应时间
not exists为例,1亿记录1到100万个点的排他过滤。
NOTICE: 1: 00:00:35.253582
NOTICE: 10: 00:00:35.256638
NOTICE: 100: 00:00:35.164034
NOTICE: 1000: 00:00:35.417756
NOTICE: 10000: 00:00:35.205454
NOTICE: 100000: 00:00:35.458987
NOTICE: 1000000: 00:00:35.447743

HTAP数据库 PostgreSQL 场景与性能测试之 3.1 - (OLAP) 大表JOIN统计查询-10亿 join 1亿 agg PostgreSQL , HTAP , OLTP , OLAP , 场景与性能测试 PostgreSQL是一个历史悠久的数据库,历史可以追溯到1973年,最早由2014计算机图灵奖得主,关系数据库的鼻祖Michael_Stonebraker 操刀设计,PostgreSQL具备与Oracle类似的功能、性能、架构以及稳定性。 PostgreSQL社区的贡献者众多
阿里云在线数据仓库服务 AnalyticDB for PostgreSQL 典型场景解析 云数据库AnalyticDB for PostgreSQL(原HybridDB forPostgreSQL)是一种大规模并行处理(MPP)数据仓库服务,可提供海量数据在线分析服务。 AnalyticDB for PostgreSQL基于PostgreSQL内核构建,兼容ANSI SQL 2003,兼容PostgreSQL/Oracle数据库生态,支持行存储和列存储模式。
【独家直播】 德哥PG系列课程15讲—PostgreSQL 多场景 沙箱实验从入门到精通 为了能够让用户可以更快的上手PostgreSQL,德哥与云栖团队的小伙伴制作了一系列阿里云RDS PostgreSQL沙箱实验。所有实验内容都结合了应用场景,非常具有代表性,兴趣性与一定的挑战性。
HTAP数据库 PostgreSQL 场景与性能测试之 47 - (OLTP) 空间应用 - 高并发空间位置更新、多属性KNN搜索并测(含空间索引)末端配送类项目 PostgreSQL , HTAP , OLTP , OLAP , 场景与性能测试 PostgreSQL是一个历史悠久的数据库,历史可以追溯到1973年,最早由2014计算机图灵奖得主,关系数据库的鼻祖Michael_Stonebraker 操刀设计,PostgreSQL具备与Oracle类似的功能、性能、架构以及稳定性。