PostgreSQL 老湿机图解平安科技遇到的垃圾回收"坑"
近日收到 平安科技 海安童鞋 那里反馈的一个问题,在生产环境使用PostgreSQL的过程中,遇到的一个有点"不可思议"的问题。
一张经常被更新的表,通过主键查询这张表的记录时,发现需要扫描异常多的数据块。
本文将为你详细剖析这个问题,同时给出规避的方法,以及内核改造的方法。
文中还涉及到索引的结构解说,仔细阅读定有收获。
.1. 和长事务有关,我在很多文章都提到过,PG在垃圾回收时,只判断垃圾版本是否是当前数据库中最老的事务之前的,如果是之后产生的,则不回收。
所以当数据库存在长事务时,同时被访问的记录被多次变更,造成一些垃圾版本没有回收。
.2. PG的索引没有版本信息,所以必须要访问heap tuple获取版本。
测试表
postgres=# create unlogged table test03 (id int primary key, info text);
频繁更新100条记录
$ vi test.sql \setrandom id 1 100 insert into test03 values(:id, repeat(md5(random()::text), 1000)) on conflict on constraint test03_pkey do update set info=excluded.info; pgbench -M prepared -n -r -P 1 -f ./test.sql -c 48 -j 48 -T 10000000
开启长事务,啥也不干
postgres=# begin; BEGIN postgres=# select txid_current(); txid_current -------------- 3474642778 (1 row)
经过一段时间的更新,发现需要访问很多数据块了。
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Index Scan using test03_pkey on public.test03 (cost=0.42..8.44 rows=1 width=417) (actual time=0.661..4.440 rows=1 loops=1) Output: id, info Index Cond: (test03.id = 2) Buffers: shared hit=1753 Planning time: 0.104 ms Execution time: 4.468 ms (6 rows)
观察访问很多的块是heap块
postgres=# set enable_indexscan=off; postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test03 (cost=4.43..8.44 rows=1 width=416) (actual time=5.818..5.819 rows=1 loops=1) Output: id, info Recheck Cond: (test03.id = 2) Heap Blocks: exact=1986 Buffers: shared hit=1996 - Bitmap Index Scan on test03_pkey (cost=0.00..4.43 rows=1 width=0) (actual time=0.418..0.418 rows=1986 loops=1) Index Cond: (test03.id = 2) Buffers: shared hit=10 Planning time: 0.200 ms Execution time: 5.851 ms (10 rows)
提交长事务前,使用vacuum verbose可以看到无法回收这些持续产生的垃圾page(包括index和heap的page)。
提交长事务
postgres=# end; COMMIT
等待autovacuum进程回收垃圾,delete half index page。
访问的数据块数量下降了。
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2; QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test03 (cost=4.43..8.45 rows=1 width=417) (actual time=0.113..0.118 rows=1 loops=1) Output: id, info Recheck Cond: (test03.id = 2) Heap Blocks: exact=3 Buffers: shared hit=14 - Bitmap Index Scan on test03_pkey (cost=0.00..4.43 rows=1 width=0) (actual time=0.067..0.067 rows=3 loops=1) Index Cond: (test03.id = 2) Buffers: shared hit=11 Planning time: 0.101 ms Execution time: 0.148 ms (10 rows)
使用pageinspect观察测试过程中索引页的内容变化
创建extension
postgres=# create extension pageinspect;
开启长事务
postgres=# begin; BEGIN postgres=# select txid_current();
测试60秒更新
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 48 -j 48 -T 60
观察需要扫描多少数据块
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Index Scan using test03_pkey on public.test03 (cost=0.43..8.45 rows=1 width=417) (actual time=0.052..15.738 rows=1 loops=1) Output: id, info Index Cond: (test03.id = 1) Buffers: shared hit=2663 Planning time: 0.572 ms Execution time: 15.790 ms (6 rows) postgres=# set enable_indexscan=off; postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test03 (cost=4.44..8.45 rows=1 width=417) (actual time=6.138..6.139 rows=1 loops=1) Output: id, info Recheck Cond: (test03.id = 1) Heap Blocks: exact=2651 Buffers: shared hit=2663 - Bitmap Index Scan on test03_pkey (cost=0.00..4.44 rows=1 width=0) (actual time=0.585..0.585 rows=2651 loops=1) Index Cond: (test03.id = 1) Buffers: shared hit=12 Planning time: 0.093 ms Execution time: 6.218 ms (10 rows)
观察索引页, root=412, 层级=2
postgres=# select * from bt_metap(test03_pkey); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 412 | 2 | 412 | 2 (1 row)
查看root页内容
postgres=# select * from bt_page_stats(test03_pkey,412); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 412 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 2 | 2 (1 row) postgres=# select * from bt_page_items(test03_pkey,412); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (3,1) | 8 | f | f | 2 | (584,1) | 16 | f | f | 21 00 00 00 00 00 00 00 3 | (411,1) | 16 | f | f | 46 00 00 00 00 00 00 00 (3 rows)
查看最左branch 页内容
postgres=# select * from bt_page_items(test03_pkey,3); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (58,1) | 16 | f | f | 21 00 00 00 00 00 00 00 2 | (1,1) | 8 | f | f | 3 | (937,1) | 16 | f | f | 01 00 00 00 00 00 00 00 4 | (767,1) | 16 | f | f | 01 00 00 00 00 00 00 00 5 | (666,1) | 16 | f | f | 01 00 00 00 00 00 00 00 6 | (572,1) | 16 | f | f | 01 00 00 00 00 00 00 00 7 | (478,1) | 16 | f | f | 01 00 00 00 00 00 00 00 8 | (395,1) | 16 | f | f | 01 00 00 00 00 00 00 00 9 | (307,1) | 16 | f | f | 01 00 00 00 00 00 00 00 10 | (173,1) | 16 | f | f | 01 00 00 00 00 00 00 00 11 | (99,1) | 16 | f | f | 01 00 00 00 00 00 00 00 12 | (951,1) | 16 | f | f | 02 00 00 00 00 00 00 00 13 | (867,1) | 16 | f | f | 02 00 00 00 00 00 00 00 14 | (773,1) | 16 | f | f | 02 00 00 00 00 00 00 00 15 | (660,1) | 16 | f | f | 02 00 00 00 00 00 00 00 16 | (564,1) | 16 | f | f | 02 00 00 00 00 00 00 00 17 | (496,1) | 16 | f | f | 02 00 00 00 00 00 00 00 18 | (413,1) | 16 | f | f | 02 00 00 00 00 00 00 00 19 | (319,1) | 16 | f | f | 02 00 00 00 00 00 00 00 20 | (204,1) | 16 | f | f | 02 00 00 00 00 00 00 00 21 | (151,1) | 16 | f | f | 02 00 00 00 00 00 00 00 22 | (64,1) | 16 | f | f | 02 00 00 00 00 00 00 00 23 | (865,1) | 16 | f | f | 03 00 00 00 00 00 00 00 24 | (777,1) | 16 | f | f | 03 00 00 00 00 00 00 00
查看包含最小值的最左叶子节点内容
postgres=# select * from bt_page_items(test03_pkey,1); itemoffset | ctid | itemlen | nulls | vars | data ------------+------------+---------+-------+------+------------------------- 1 | (57342,14) | 16 | f | f | 01 00 00 00 00 00 00 00 2 | (71195,14) | 16 | f | f | 01 00 00 00 00 00 00 00 3 | (71171,12) | 16 | f | f | 01 00 00 00 00 00 00 00 4 | (71185,1) | 16 | f | f | 01 00 00 00 00 00 00 00 5 | (71150,17) | 16 | f | f | 01 00 00 00 00 00 00 00 6 | (71143,1) | 16 | f | f | 01 00 00 00 00 00 00 00 ......
查看包含最小值的最右叶子节点内容
postgres=# select * from bt_page_items(test03_pkey,99); itemoffset | ctid | itemlen | nulls | vars | data ------------+------------+---------+-------+------+------------------------- 1 | (66214,10) | 16 | f | f | 02 00 00 00 00 00 00 00 2 | (12047,15) | 16 | f | f | 01 00 00 00 00 00 00 00 ...... 40 | (11052,15) | 16 | f | f | 01 00 00 00 00 00 00 00 41 | (11009,6) | 16 | f | f | 01 00 00 00 00 00 00 00 42 | (11021,6) | 16 | f | f | 01 00 00 00 00 00 00 00 43 | (71209,3) | 16 | f | f | 02 00 00 00 00 00 00 00 44 | (69951,1) | 16 | f | f | 02 00 00 00 00 00 00 00
查看这些叶子索引页包含data=01 00 00 00 00 00 00 00的item有多少条,可以对应到需要扫描多少heap page
select count(distinct substring(ctid::text, 1, "position"(ctid::text, ,))) from ( select * from bt_page_items(test03_pkey,1) union all select * from bt_page_items(test03_pkey,937) union all select * from bt_page_items(test03_pkey,767) union all select * from bt_page_items(test03_pkey,666) union all select * from bt_page_items(test03_pkey,572) union all select * from bt_page_items(test03_pkey,478) union all select * from bt_page_items(test03_pkey,395) union all select * from bt_page_items(test03_pkey,307) union all select * from bt_page_items(test03_pkey,173) union all select * from bt_page_items(test03_pkey,99) union all select * from bt_page_items(test03_pkey,951) where data=01 00 00 00 00 00 00 00; count ------- 2652 (1 row)
2652与前面执行计划中看到的2651对应。
提交长事务
postgres=# end; COMMIT
等待autovacuum结束
postgres=# select * from pg_stat_all_tables where relname=test03; -[ RECORD 1 ]-------+------------------------------ relid | 14156713 schemaname | public relname | test03 seq_scan | 39 seq_tup_read | 5137822 idx_scan | 3522865664 idx_tup_fetch | 3521843178 n_tup_ins | 1022487 n_tup_upd | 3476465702 n_tup_del | 22387 n_tup_hot_upd | 3433472972 n_live_tup | 100 n_dead_tup | 0 n_mod_since_analyze | 0 last_vacuum | 2016-07-15 00:03:53.909086+08 last_autovacuum | 2016-07-15 00:32:04.177672+08 last_analyze | 2016-07-15 00:03:53.909825+08 last_autoanalyze | 2016-07-15 00:07:23.541629+08 vacuum_count | 10 autovacuum_count | 125 analyze_count | 7 autoanalyze_count | 99
观察现在需要扫描多少块
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test03 (cost=40.40..44.41 rows=1 width=417) (actual time=0.026..0.027 rows=1 loops=1) Output: id, info Recheck Cond: (test03.id = 1) Heap Blocks: exact=1 Buffers: shared hit=5 - Bitmap Index Scan on test03_pkey (cost=0.00..40.40 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=1) Index Cond: (test03.id = 1) Buffers: shared hit=4 Planning time: 0.137 ms Execution time: 0.052 ms (10 rows)
查看现在的索引页内容,half page已经remove掉了
postgres=# select count(distinct substring(ctid::text, 1, "position"(ctid::text, ,))) from ( select * from bt_page_items(test03_pkey,1) union all select * from bt_page_items(test03_pkey,937) union all select * from bt_page_items(test03_pkey,767) union all select * from bt_page_items(test03_pkey,666) union all select * from bt_page_items(test03_pkey,572) union all select * from bt_page_items(test03_pkey,478) union all select * from bt_page_items(test03_pkey,395) union all select * from bt_page_items(test03_pkey,307) union all select * from bt_page_items(test03_pkey,173) union all select * from bt_page_items(test03_pkey,99) union all select * from bt_page_items(test03_pkey,951) where data=01 00 00 00 00 00 00 00 ; NOTICE: page is deleted NOTICE: page is deleted NOTICE: page is deleted NOTICE: page is deleted NOTICE: page is deleted NOTICE: page is deleted NOTICE: page is deleted NOTICE: page is deleted NOTICE: page is deleted -[ RECORD 1 ] count | 2
再观察索引页内容,已经被autovacuum收缩了
postgres=# select * from bt_metap(test03_pkey); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 412 | 2 | 412 | 2 (1 row) postgres=# select * from bt_page_items(test03_pkey,412); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (3,1) | 8 | f | f | 2 | (584,1) | 16 | f | f | 21 00 00 00 00 00 00 00 3 | (411,1) | 16 | f | f | 46 00 00 00 00 00 00 00 (3 rows) postgres=# select * from bt_page_items(test03_pkey,3); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (58,1) | 16 | f | f | 21 00 00 00 00 00 00 00 2 | (1,1) | 8 | f | f | 3 | (99,1) | 16 | f | f | 01 00 00 00 00 00 00 00 4 | (865,1) | 16 | f | f | 02 00 00 00 00 00 00 00 5 | (844,1) | 16 | f | f | 03 00 00 00 00 00 00 00 6 | (849,1) | 16 | f | f | 04 00 00 00 00 00 00 00 7 | (18,1) | 16 | f | f | 05 00 00 00 00 00 00 00 8 | (95,1) | 16 | f | f | 06 00 00 00 00 00 00 00 9 | (63,1) | 16 | f | f | 07 00 00 00 00 00 00 00 10 | (34,1) | 16 | f | f | 08 00 00 00 00 00 00 00 11 | (851,1) | 16 | f | f | 09 00 00 00 00 00 00 00 12 | (10,1) | 16 | f | f | 0a 00 00 00 00 00 00 00 13 | (71,1) | 16 | f | f | 0b 00 00 00 00 00 00 00 14 | (774,1) | 16 | f | f | 0c 00 00 00 00 00 00 00 15 | (213,1) | 16 | f | f | 0d 00 00 00 00 00 00 00 16 | (881,1) | 16 | f | f | 0e 00 00 00 00 00 00 00 17 | (837,1) | 16 | f | f | 0f 00 00 00 00 00 00 00 18 | (100,1) | 16 | f | f | 10 00 00 00 00 00 00 00 19 | (872,1) | 16 | f | f | 11 00 00 00 00 00 00 00 20 | (32,1) | 16 | f | f | 12 00 00 00 00 00 00 00 21 | (65,1) | 16 | f | f | 13 00 00 00 00 00 00 00 22 | (870,1) | 16 | f | f | 14 00 00 00 00 00 00 00 23 | (841,1) | 16 | f | f | 15 00 00 00 00 00 00 00 24 | (850,1) | 16 | f | f | 16 00 00 00 00 00 00 00 25 | (30,1) | 16 | f | f | 17 00 00 00 00 00 00 00 26 | (91,1) | 16 | f | f | 18 00 00 00 00 00 00 00 27 | (829,1) | 16 | f | f | 19 00 00 00 00 00 00 00 28 | (16,1) | 16 | f | f | 1a 00 00 00 00 00 00 00 29 | (784,1) | 16 | f | f | 1b 00 00 00 00 00 00 00 30 | (31,1) | 16 | f | f | 1c 00 00 00 00 00 00 00 31 | (88,1) | 16 | f | f | 1d 00 00 00 00 00 00 00 32 | (48,1) | 16 | f | f | 1e 00 00 00 00 00 00 00 33 | (822,1) | 16 | f | f | 1f 00 00 00 00 00 00 00 34 | (817,1) | 16 | f | f | 20 00 00 00 00 00 00 00 35 | (109,1) | 16 | f | f | 21 00 00 00 00 00 00 00 (35 rows) postgres=# select * from bt_page_items(test03_pkey,1); itemoffset | ctid | itemlen | nulls | vars | data ------------+------------+---------+-------+------+------------------------- 1 | (57342,14) | 16 | f | f | 01 00 00 00 00 00 00 00 2 | (71195,14) | 16 | f | f | 01 00 00 00 00 00 00 00 (2 rows) postgres=# select * from bt_page_items(test03_pkey,99); itemoffset | ctid | itemlen | nulls | vars | data ------------+------------+---------+-------+------+------------------------- 1 | (66214,10) | 16 | f | f | 02 00 00 00 00 00 00 00 2 | (71209,3) | 16 | f | f | 02 00 00 00 00 00 00 00 (2 rows)
src/backend/access/nbtree/nbtpage.c
/* * Unlink a page in a branch of half-dead pages from its siblings. * If the leaf page still has a downlink pointing to it, unlinks the highest * parent in the to-be-deleted branch instead of the leaf page. To get rid * of the whole branch, including the leaf page itself, iterate until the * leaf page is deleted. * Returns false if the page could not be unlinked (shouldnt happen). * If the (new) right sibling of the page is empty, *rightsib_empty is set * to true. static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) * Mark the page itself deleted. It can be recycled when all current * transactions are gone. Storing GetTopTransactionId() would work, but * were in VACUUM and would not otherwise have an XID. Having already * updated links to the target, ReadNewTransactionId() suffices as an * upper bound. Any scan having retained a now-stale link is advertising * in its PGXACT an xmin less than or equal to the value we read here. It * will continue to do so, holding back RecentGlobalXmin, for the duration * of that scan. page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque- btpo_flags = ~BTP_HALF_DEAD; opaque- btpo_flags |= BTP_DELETED; opaque- btpo.xact = ReadNewTransactionId(); ...
contrib/pageinspect/btreefuncs.c
if (P_ISDELETED(opaque)) elog(NOTICE, "page is deleted");
1. b-tree原理
https://yq.aliyun.com/articles/54437
1. 频繁更新的表,数据库的优化手段
1.1 监控长事务,绝对控制长事务
1.2 缩小autovacuum naptime (to 1s) ,
增加autovacuum work (to 10), 设置autovacuum delay=0, 增大autovacuum work memory (to 512MB or bigger), 将经常变更的表和索引放到好的iops的设备上 。 不要小看这几个参数,非常的关键。
1.3 如果事务释放并且表上面已经出发了vacuum后,还是要查很多的PAGE,说明index page没有delete和收缩,可能是index page没有达到compact的要求,如果遇到这种情况,需要reindex。
2. PostgreSQL 9.6通过快照过旧彻底解决这个长事务引发的坑爹问题。
9.6 vacuum的改进如图
如何判断snapshot too old如图
3. 9.6的垃圾回收机制也还有改进的空间,做到更细粒度的版本控制,改进方法以前分享过,在事务列表中增加记录事务隔离级别,通过隔离级别判断需要保留的版本,而不是简单的通过最老事务来判断需要保留的垃圾版本。
祝大家玩得开心,欢迎随时来 阿里云促膝长谈 业务需求 ,恭候光临。
阿里云的小伙伴们加油,努力做 最贴地气的云数据库 。
PostgreSQL 垃圾回收参数优化之 - maintenance_work_mem , autovacuum_work_mem PostgreSQL , 垃圾回收 , 索引扫描 , 内存 夜谈PostgreSQL 垃圾回收参数优化之 - maintenance_work_mem , autovacuum_work_mem。 http://www.postgres.cn/v2/news/viewone/1/398 https://rhaas.blogspot.com/2019/01/how-much
PostgreSQL物理 备库 的哪些操作或配置,可能影响 主库 的性能、垃圾回收、IO波动 PostgreSQL , 物理复制 , 垃圾回收 , vacuum_defer_cleanup_age , hot_standby_feedback , max_standby_archive_delay , max_standby_streaming_delay PostgreSQL 物理备库的哪些配置,或者哪些操作,可能影响到主库呢? 首先,简单介绍一下PostgreSQL的物理备库,物理备库就是基于PostgreSQL WAL流式复制,实时恢复的备库。
阿里云数据库专家于巍荣获PostgreSQL中国技术大会“最具价值专家 MVP”奖 2023年3月3日,在由PostgreSQL中文社区主办的“第十二届PostgreSQL中国技术大会”上,阿里云数据库开源首席架构师于巍荣获“中国 PostgreSQL 最具价值专家 MVP”奖项。
喜讯!阿里云数据库PolarDB荣获第12届PostgreSQL中国技术大会“开源数据库杰出贡献奖” 2023年3月3日,由PostgreSQL中文社区主办的“第十二届PostgreSQL中国技术大会”在杭州隆重开幕。本次大会以“突破·进化·共赢 —— 安全可靠,共建与机遇”为主题,邀请了来自国内外的行业专家、企业家代表等,共见创新成果、共论国产数据库发展机遇。会上,PolarDB for PostgreSQL(简称PolarDB-PG )凭借在数据库开源领域的深耕布局和产品技术实力,荣获“开源数据库杰出贡献奖”。
《RDS数据库入门一本通》电子版地址 本书体系完整,理论充实,内容由浅入深,循序渐进,更配有详细的实例解说,为初学者提供一个完整、严密的思维框架,零基础的你也可轻松掌握RDS数据库的美妙节奏。
相关文章
- 《图解算法》系列学习(三)
- 超强图解 Pandas 18 招!
- Canvas 从入门到劝朋友放弃(图解版)
- 归并排序算法详细图解_归并排序算法描述
- 《图解HTTP》读书总结-第一章-了解Web及网络基础
- 决策树原理及使用_虹吸原理图解
- 【数据结构与算法】:带你熟悉归并排序(手绘图解+leetCode原题)
- 图解K8s源码 - 序章 - K8s组件架构
- Windows下PostgreSQL安装图解
- Postgresql复制技术:一次解决永久问题(postgresql复制)
- 优化实现PostgreSQL缓存优化,更高性能!(postgresql缓存)
- Postgresql:开启你的数据之旅(进入postgresql)
- 体验PostgreSQL之精彩:驱动新体验(postgresql驱动)
- ?探究PostgreSQL:一款强大的数据库系统(postgresql是什么)
- 『Postgresql实现分区表:优化查询效率』(postgresql分区表)
- 引擎使用PostgreSQL实现规则引擎功能(postgresql规则)
- PostgreSQL:优势与不足(postgresql优缺点)
- 轻松搞定:Linux安装DB2的简明图解(linux安装db2图解)
- 优势PostgreSQL归档:优势及其应用(postgresql归档)
- 图解java并发(上)
- PostgreSQL 接口:连接数据库的必备工具(postgresql接口)
- 初探PostgreSQL:走上NoSQL学习之路(postgresql学习)
- 如何在PostgreSQL中创建新用户(postgresql创建用户)
- 25 Tips for Optimizing Your PostgreSQL Queries(postgresql查询优化)
- PostgreSQL实例实战:了解数据库管理系统(postgresql实例)
- Redis实现的高位进位加法图解(redis高位进位加法图)
- Redis中雪崩效应完美图解(redis雪崩图解)
- Redis运行机制图解有助于提升运行效率(redis运行图解)
- 利用iframe实现ajax跨域通信的实现原理(图解)