zl程序教程

您现在的位置是:首页 >  其他

当前栏目

PostgreSQL DISTINCT 和 DISTINCT ON 语法的使用

Onpostgresql 语法 Distinct 使用
2023-09-27 14:27:57 时间
背景

本文介绍一下distinct的几种用法。

https://www.postgresql.org/docs/10/static/queries-select-lists.html#queries-distinct

1、返回唯一值

After the select list has been processed, the result table can optionally be subject to the elimination of duplicate rows. 

The DISTINCT key word is written directly after SELECT to specify this: 

SELECT DISTINCT select_list ... 

(Instead of DISTINCT the key word ALL can be used to specify the default behavior of retaining all rows.) 

Obviously, two rows are considered distinct if they differ in at least one column value. Null values are considered equal in this comparison. 

2、返回指定列唯一的任意行。 也可以使用窗口来实现类似功能, 但是性能没有distinct on好,因为它是任意的。

Alternatively, an arbitrary expression can determine what rows are to be considered distinct: 

SELECT DISTINCT ON (expression [, expression ...]) select_list ... 

Here expression is an arbitrary value expression that is evaluated for all rows. 

A set of rows for which all the expressions are equal are considered duplicates, 

and only the first row of the set is kept in the output. 

Note that the “first row” of a set is unpredictable unless the query is sorted 

on enough columns to guarantee a unique ordering of the rows arriving at the DISTINCT filter. 

(DISTINCT ON processing occurs after ORDER BY sorting.) 

The DISTINCT ON clause is not part of the SQL standard and is sometimes considered bad 

style because of the potentially indeterminate nature of its results. 

With judicious use of GROUP BY and subqueries in FROM, 

this construct can be avoided, but it is often the most convenient alternative. 

3、返回唯一值个数

select count(distinct (表达式1, ....)) from tbl; 

select count(*) from (select 表达式,.... from tbl group by 表达式,....) t; 

例子

1、返回所有记录。

select ALL id, c1 from test; 

select id, c1 from test; 

2、返回 id,c1 唯一值。(这里NULL视为相等)。

select DISTINCT id, c1 from test; 

select id, c1 from test group by id, c1; 

3、返回c3唯一的任意行。

select distinct on (c3) c2,c3 from tbl; 

postgres=# explain (analyze,verbose,timing,costs,buffers) select distinct on (c3) c2,c3 from tbl; 

 QUERY PLAN 

---------------------------------------------------------------------------------------------------------------------------------- 

 Unique (cost=115063.84..120063.84 rows=11 width=8) (actual time=1865.317..2279.840 rows=11 loops=1) 

 Output: c2, c3 

 Buffers: shared hit=5406 

 - Sort (cost=115063.84..117563.84 rows=1000000 width=8) (actual time=1865.312..2068.536 rows=1000000 loops=1) 

 Output: c2, c3 

 Sort Key: tbl.c3 

 Sort Method: quicksort Memory: 71452kB 

 Buffers: shared hit=5406 

 - Seq Scan on public.tbl (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.017..264.041 rows=1000000 loops=1) 

 Output: c2, c3 

 Buffers: shared hit=5406 

 Planning time: 0.070 ms 

 Execution time: 2291.536 ms 

(13 rows) 

postgres=# select distinct on (c3) c2,c3 from tbl; -- c3 唯一, 但是可能返回任意行 

 c2 | c3 

-----+---- 

 100 | 0 

 23 | 1 

 63 | 2 

 47 | 3 

 34 | 4 

 17 | 5 

 60 | 6 

 8 | 7 

 70 | 8 

 73 | 9 

 94 | 10 

(11 rows) 

使用窗口函数可以达到类似效果,但是可以确定返回哪行,因此也更慢一些:

select * from (select row_number() over (partition by c3) as rn, * from tbl) t where rn=1; 

postgres=# select * from (select row_number() over (partition by c3) as rn, * from tbl) t where rn=1; 

 rn | c1 | c2 | c3 

----+-----+-----+---- 

 1 | 420 | 100 | 0 

 1 | 721 | 23 | 1 

 1 | 80 | 63 | 2 

 1 | 322 | 47 | 3 

 1 | 457 | 34 | 4 

 1 | 386 | 17 | 5 

 1 | 491 | 60 | 6 

 1 | 260 | 8 | 7 

 1 | 41 | 70 | 8 

 1 | 56 | 73 | 9 

 1 | 154 | 94 | 10 

(11 rows) 

postgres=# explain analyze select * from (select row_number() over (partition by c3) as rn, * from tbl) t where rn=1; 

 QUERY PLAN 

---------------------------------------------------------------------------------------------------------------------------------- 

 Subquery Scan on t (cost=115063.84..145063.84 rows=5000 width=20) (actual time=1855.132..2860.276 rows=11 loops=1) 

 Filter: (t.rn = 1) 

 Rows Removed by Filter: 999989 

 - WindowAgg (cost=115063.84..132563.84 rows=1000000 width=20) (actual time=1855.129..2739.190 rows=1000000 loops=1) 

 - Sort (cost=115063.84..117563.84 rows=1000000 width=12) (actual time=1855.115..2028.946 rows=1000000 loops=1) 

 Sort Key: tbl.c3 

 Sort Method: quicksort Memory: 71452kB 

 - Seq Scan on tbl (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.015..251.021 rows=1000000 loops=1) 

 Planning time: 0.115 ms 

 Execution time: 2871.551 ms 

(10 rows) 

4、返回有多少个唯一值

select count(distinct (表达式,....)) from tbl; 

postgres=# select count(distinct c3) from tbl; 

 count 

------- 

(1 row) 

postgres=# select count(distinct (c3,c2)) from tbl; 

 count 

------- 

 1111 

(1 row) 

postgres=# explain (analyze,verbose,timing,costs,buffers) select count(distinct (c3,c2)) from tbl;; 

 QUERY PLAN 

---------------------------------------------------------------------------------------------------------------------------- 

 Aggregate (cost=17906.00..17906.01 rows=1 width=8) (actual time=6905.660..6905.661 rows=1 loops=1) 

 Output: count(DISTINCT ROW(c3, c2)) 

 Buffers: shared hit=5406 

 - Seq Scan on public.tbl (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.017..156.436 rows=1000000 loops=1) 

 Output: c1, c2, c3 

 Buffers: shared hit=5406 

 Planning time: 0.062 ms 

 Execution time: 6905.727 ms 

(8 rows) 

或使用group by的方法

select count(*) from (select 表达式,.... from tbl group by 表达式,....) t; 

postgres=# select count(*) from (select c2,c3 from tbl group by c2,c3) t; 

 count 

------- 

 1111 

(1 row) 

postgres=# explain (analyze,verbose,timing,costs,buffers) select count(*) from (select c2,c3 from tbl group by c2,c3) t; 

 QUERY PLAN 

---------------------------------------------------------------------------------------------------------------------------------- 

 Aggregate (cost=20431.00..20431.01 rows=1 width=8) (actual time=674.609..674.609 rows=1 loops=1) 

 Output: count(*) 

 Buffers: shared hit=5406 

 - HashAggregate (cost=20406.00..20417.11 rows=1111 width=8) (actual time=674.093..674.409 rows=1111 loops=1) 

 Output: tbl.c2, tbl.c3 

 Group Key: tbl.c2, tbl.c3 

 Buffers: shared hit=5406 

 - Seq Scan on public.tbl (cost=0.00..15406.00 rows=1000000 width=8) (actual time=0.014..143.904 rows=1000000 loops=1) 

 Output: tbl.c1, tbl.c2, tbl.c3 

 Buffers: shared hit=5406 

 Planning time: 0.120 ms 

 Execution time: 674.684 ms 

(12 rows) 

PostgreSQL 12 文档: SQL 语法 SQL 命令 这部分包含PostgreSQL支持的SQL命令的参考信息。每条命令的标准符合和兼容的信息可以在相关的参考页中找到。
PostgreSQL的学习心得和知识总结(二十五)|语法级自上而下完美实现MySQL数据库的 字段默认值的自动插入更新 的实现方案 本人CSDN博主 孤傲小二~阿沐,本文《PostgreSQL的学习心得和知识总结(二十五)|语法级自上而下完美实现MySQL数据库的 字段默认值的自动插入更新 的实现方案》来自于我在CSDN的同名文档
PostgreSQL 12 preview - CTE 增强,支持用户语法层控制 materialized 优化 PostgreSQL , CTE , materialized , not materialized , push down PostgreSQL with 语法,能跑非常复杂的SQL逻辑,包括递归,多语句物化计算等。 在12以前的版本中,WITH中的每一个CTE(common table express),都是直接进行物化的,也就是说外层的条件不会推到CTE(物化节点)里
2、重分布去重后的数据,(这一步,如果distinct值特别多,那么就会比较耗时) 3、然后再去重, 4、最后count (xxx), 5、求所有节点的count SUM。
PostgreSQL允许带有命名参数的函数被使用位置或命名记号法调用。命名记号法对于有大量参数的函数特别有用,因为它让参数和实际参数之间的关联更明显和可靠。
PostgreSQL 10.1 手册_部分 II. SQL 语言_第 4 章 SQL语法_4.2. 值表达式 4.2. 值表达式 4.2.1. 列引用 4.2.2. 位置参数 4.2.3. 下标 4.2.4. 域选择 4.2.5. 操作符调用 4.2.6. 函数调用 4.2.7. 聚合表达式 4.2.8. 窗口函数调用 4.2.9. 类型转换 4.2.10. 排序规则表达式 4.2.11. 标量子查询 4.2.12. 数组构造器 4.2.13. 行构造器 4.2.14. 表达式计算规则
值表达式被用于各种各样的环境中,例如在SELECT命令的目标列表中、作为INSERT或UPDATE中的新列值或者若干命令中的搜索条件。