zl程序教程

您现在的位置是:首页 >  后端

当前栏目

PostgreSQL 老湿机图解平安科技遇到的垃圾回收"坑"

图解postgresql 科技 quot 遇到 垃圾 回收 平安
2023-09-14 09:04:38 时间

近日收到 平安科技 海安童鞋 那里反馈的一个问题,在生产环境使用PostgreSQL的过程中,遇到的一个有点"不可思议"的问题。

一张经常被更新的表,通过主键查询这张表的记录时,发现需要扫描异常多的数据块。

本文将为你详细剖析这个问题,同时给出规避的方法,以及内核改造的方法。

文中还涉及到索引的结构解说,仔细阅读定有收获。

.1. 和长事务有关,我在很多文章都提到过,PG在垃圾回收时,只判断垃圾版本是否是当前数据库中最老的事务之前的,如果是之后产生的,则不回收。

所以当数据库存在长事务时,同时被访问的记录被多次变更,造成一些垃圾版本没有回收。
screenshot

.2. PG的索引没有版本信息,所以必须要访问heap tuple获取版本。
screenshot

测试表

postgres=# create unlogged table test03 (id int primary key, info text);

频繁更新100条记录

$ vi test.sql

\setrandom id 1 100

insert into test03 values(:id, repeat(md5(random()::text), 1000)) on conflict on constraint test03_pkey do update set info=excluded.info;

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 48 -j 48 -T 10000000

开启长事务,啥也不干

postgres=# begin;

BEGIN

postgres=# select txid_current();

 txid_current 

--------------

 3474642778

(1 row)

经过一段时间的更新,发现需要访问很多数据块了。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2;

 QUERY PLAN 

-----------------------------------------------------------------------------------------------------------------------------

 Index Scan using test03_pkey on public.test03 (cost=0.42..8.44 rows=1 width=417) (actual time=0.661..4.440 rows=1 loops=1)

 Output: id, info

 Index Cond: (test03.id = 2)

 Buffers: shared hit=1753

 Planning time: 0.104 ms

 Execution time: 4.468 ms

(6 rows)

观察访问很多的块是heap块

postgres=# set enable_indexscan=off;

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2;

 QUERY PLAN 

-----------------------------------------------------------------------------------------------------------------------

 Bitmap Heap Scan on public.test03 (cost=4.43..8.44 rows=1 width=416) (actual time=5.818..5.819 rows=1 loops=1)

 Output: id, info

 Recheck Cond: (test03.id = 2)

 Heap Blocks: exact=1986

 Buffers: shared hit=1996

 - Bitmap Index Scan on test03_pkey (cost=0.00..4.43 rows=1 width=0) (actual time=0.418..0.418 rows=1986 loops=1)

 Index Cond: (test03.id = 2)

 Buffers: shared hit=10

 Planning time: 0.200 ms

 Execution time: 5.851 ms

(10 rows)

提交长事务前,使用vacuum verbose可以看到无法回收这些持续产生的垃圾page(包括index和heap的page)。

提交长事务

postgres=# end;

COMMIT

等待autovacuum进程回收垃圾,delete half index page。
访问的数据块数量下降了。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2;

 QUERY PLAN 

--------------------------------------------------------------------------------------------------------------------

 Bitmap Heap Scan on public.test03 (cost=4.43..8.45 rows=1 width=417) (actual time=0.113..0.118 rows=1 loops=1)

 Output: id, info

 Recheck Cond: (test03.id = 2)

 Heap Blocks: exact=3

 Buffers: shared hit=14

 - Bitmap Index Scan on test03_pkey (cost=0.00..4.43 rows=1 width=0) (actual time=0.067..0.067 rows=3 loops=1)

 Index Cond: (test03.id = 2)

 Buffers: shared hit=11

 Planning time: 0.101 ms

 Execution time: 0.148 ms

(10 rows)

使用pageinspect观察测试过程中索引页的内容变化

创建extension

postgres=# create extension pageinspect;

开启长事务

postgres=# begin;

BEGIN

postgres=# select txid_current();

测试60秒更新

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 48 -j 48 -T 60

观察需要扫描多少数据块

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1;

 QUERY PLAN 

------------------------------------------------------------------------------------------------------------------------------

 Index Scan using test03_pkey on public.test03 (cost=0.43..8.45 rows=1 width=417) (actual time=0.052..15.738 rows=1 loops=1)

 Output: id, info

 Index Cond: (test03.id = 1)

 Buffers: shared hit=2663

 Planning time: 0.572 ms

 Execution time: 15.790 ms

(6 rows)

postgres=# set enable_indexscan=off;

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1;

 QUERY PLAN 

-----------------------------------------------------------------------------------------------------------------------

 Bitmap Heap Scan on public.test03 (cost=4.44..8.45 rows=1 width=417) (actual time=6.138..6.139 rows=1 loops=1)

 Output: id, info

 Recheck Cond: (test03.id = 1)

 Heap Blocks: exact=2651

 Buffers: shared hit=2663

 - Bitmap Index Scan on test03_pkey (cost=0.00..4.44 rows=1 width=0) (actual time=0.585..0.585 rows=2651 loops=1)

 Index Cond: (test03.id = 1)

 Buffers: shared hit=12

 Planning time: 0.093 ms

 Execution time: 6.218 ms

(10 rows)

观察索引页, root=412, 层级=2

postgres=# select * from bt_metap(test03_pkey);

 magic | version | root | level | fastroot | fastlevel 

--------+---------+------+-------+----------+-----------

 340322 | 2 | 412 | 2 | 412 | 2

(1 row)

查看root页内容

postgres=# select * from bt_page_stats(test03_pkey,412);

 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 

-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------

 412 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 2 | 2

(1 row)

postgres=# select * from bt_page_items(test03_pkey,412);

 itemoffset | ctid | itemlen | nulls | vars | data 

------------+---------+---------+-------+------+-------------------------

 1 | (3,1) | 8 | f | f | 

 2 | (584,1) | 16 | f | f | 21 00 00 00 00 00 00 00

 3 | (411,1) | 16 | f | f | 46 00 00 00 00 00 00 00

(3 rows)

查看最左branch 页内容

postgres=# select * from bt_page_items(test03_pkey,3);

 itemoffset | ctid | itemlen | nulls | vars | data 

------------+---------+---------+-------+------+-------------------------

 1 | (58,1) | 16 | f | f | 21 00 00 00 00 00 00 00

 2 | (1,1) | 8 | f | f | 

 3 | (937,1) | 16 | f | f | 01 00 00 00 00 00 00 00

 4 | (767,1) | 16 | f | f | 01 00 00 00 00 00 00 00

 5 | (666,1) | 16 | f | f | 01 00 00 00 00 00 00 00

 6 | (572,1) | 16 | f | f | 01 00 00 00 00 00 00 00

 7 | (478,1) | 16 | f | f | 01 00 00 00 00 00 00 00

 8 | (395,1) | 16 | f | f | 01 00 00 00 00 00 00 00

 9 | (307,1) | 16 | f | f | 01 00 00 00 00 00 00 00

 10 | (173,1) | 16 | f | f | 01 00 00 00 00 00 00 00

 11 | (99,1) | 16 | f | f | 01 00 00 00 00 00 00 00

 12 | (951,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 13 | (867,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 14 | (773,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 15 | (660,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 16 | (564,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 17 | (496,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 18 | (413,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 19 | (319,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 20 | (204,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 21 | (151,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 22 | (64,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 23 | (865,1) | 16 | f | f | 03 00 00 00 00 00 00 00

 24 | (777,1) | 16 | f | f | 03 00 00 00 00 00 00 00

查看包含最小值的最左叶子节点内容

postgres=# select * from bt_page_items(test03_pkey,1);

 itemoffset | ctid | itemlen | nulls | vars | data 

------------+------------+---------+-------+------+-------------------------

 1 | (57342,14) | 16 | f | f | 01 00 00 00 00 00 00 00

 2 | (71195,14) | 16 | f | f | 01 00 00 00 00 00 00 00

 3 | (71171,12) | 16 | f | f | 01 00 00 00 00 00 00 00

 4 | (71185,1) | 16 | f | f | 01 00 00 00 00 00 00 00

 5 | (71150,17) | 16 | f | f | 01 00 00 00 00 00 00 00

 6 | (71143,1) | 16 | f | f | 01 00 00 00 00 00 00 00

......

查看包含最小值的最右叶子节点内容

postgres=# select * from bt_page_items(test03_pkey,99);

 itemoffset | ctid | itemlen | nulls | vars | data 

------------+------------+---------+-------+------+-------------------------

 1 | (66214,10) | 16 | f | f | 02 00 00 00 00 00 00 00

 2 | (12047,15) | 16 | f | f | 01 00 00 00 00 00 00 00

......

 40 | (11052,15) | 16 | f | f | 01 00 00 00 00 00 00 00

 41 | (11009,6) | 16 | f | f | 01 00 00 00 00 00 00 00

 42 | (11021,6) | 16 | f | f | 01 00 00 00 00 00 00 00

 43 | (71209,3) | 16 | f | f | 02 00 00 00 00 00 00 00

 44 | (69951,1) | 16 | f | f | 02 00 00 00 00 00 00 00

查看这些叶子索引页包含data=01 00 00 00 00 00 00 00的item有多少条,可以对应到需要扫描多少heap page

select count(distinct substring(ctid::text, 1, "position"(ctid::text, ,))) from (

select * from bt_page_items(test03_pkey,1) 

union all

select * from bt_page_items(test03_pkey,937) 

union all

select * from bt_page_items(test03_pkey,767) 

union all

select * from bt_page_items(test03_pkey,666) 

union all

select * from bt_page_items(test03_pkey,572) 

union all

select * from bt_page_items(test03_pkey,478) 

union all

select * from bt_page_items(test03_pkey,395) 

union all

select * from bt_page_items(test03_pkey,307) 

union all

select * from bt_page_items(test03_pkey,173) 

union all

select * from bt_page_items(test03_pkey,99) 

union all

select * from bt_page_items(test03_pkey,951)

where data=01 00 00 00 00 00 00 00;

 count 

-------

 2652

(1 row)

2652与前面执行计划中看到的2651对应。

提交长事务

postgres=# end;

COMMIT

等待autovacuum结束

postgres=# select * from pg_stat_all_tables where relname=test03;

-[ RECORD 1 ]-------+------------------------------

relid | 14156713

schemaname | public

relname | test03

seq_scan | 39

seq_tup_read | 5137822

idx_scan | 3522865664

idx_tup_fetch | 3521843178

n_tup_ins | 1022487

n_tup_upd | 3476465702

n_tup_del | 22387

n_tup_hot_upd | 3433472972

n_live_tup | 100

n_dead_tup | 0

n_mod_since_analyze | 0

last_vacuum | 2016-07-15 00:03:53.909086+08

last_autovacuum | 2016-07-15 00:32:04.177672+08

last_analyze | 2016-07-15 00:03:53.909825+08

last_autoanalyze | 2016-07-15 00:07:23.541629+08

vacuum_count | 10

autovacuum_count | 125

analyze_count | 7

autoanalyze_count | 99

观察现在需要扫描多少块

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1;

 QUERY PLAN 

---------------------------------------------------------------------------------------------------------------------

 Bitmap Heap Scan on public.test03 (cost=40.40..44.41 rows=1 width=417) (actual time=0.026..0.027 rows=1 loops=1)

 Output: id, info

 Recheck Cond: (test03.id = 1)

 Heap Blocks: exact=1

 Buffers: shared hit=5

 - Bitmap Index Scan on test03_pkey (cost=0.00..40.40 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=1)

 Index Cond: (test03.id = 1)

 Buffers: shared hit=4

 Planning time: 0.137 ms

 Execution time: 0.052 ms

(10 rows)

查看现在的索引页内容,half page已经remove掉了

postgres=# select count(distinct substring(ctid::text, 1, "position"(ctid::text, ,))) from (

select * from bt_page_items(test03_pkey,1) 

union all

select * from bt_page_items(test03_pkey,937) 

union all

select * from bt_page_items(test03_pkey,767) 

union all

select * from bt_page_items(test03_pkey,666) 

union all

select * from bt_page_items(test03_pkey,572) 

union all

select * from bt_page_items(test03_pkey,478) 

union all

select * from bt_page_items(test03_pkey,395) 

union all

select * from bt_page_items(test03_pkey,307) 

union all

select * from bt_page_items(test03_pkey,173) 

union all

select * from bt_page_items(test03_pkey,99) 

union all

select * from bt_page_items(test03_pkey,951)

where data=01 00 00 00 00 00 00 00 ;

NOTICE: page is deleted

NOTICE: page is deleted

NOTICE: page is deleted

NOTICE: page is deleted

NOTICE: page is deleted

NOTICE: page is deleted

NOTICE: page is deleted

NOTICE: page is deleted

NOTICE: page is deleted

-[ RECORD 1 ]

count | 2

再观察索引页内容,已经被autovacuum收缩了

postgres=# select * from bt_metap(test03_pkey);

 magic | version | root | level | fastroot | fastlevel 

--------+---------+------+-------+----------+-----------

 340322 | 2 | 412 | 2 | 412 | 2

(1 row)

postgres=# select * from bt_page_items(test03_pkey,412);

 itemoffset | ctid | itemlen | nulls | vars | data 

------------+---------+---------+-------+------+-------------------------

 1 | (3,1) | 8 | f | f | 

 2 | (584,1) | 16 | f | f | 21 00 00 00 00 00 00 00

 3 | (411,1) | 16 | f | f | 46 00 00 00 00 00 00 00

(3 rows)

postgres=# select * from bt_page_items(test03_pkey,3);

 itemoffset | ctid | itemlen | nulls | vars | data 

------------+---------+---------+-------+------+-------------------------

 1 | (58,1) | 16 | f | f | 21 00 00 00 00 00 00 00

 2 | (1,1) | 8 | f | f | 

 3 | (99,1) | 16 | f | f | 01 00 00 00 00 00 00 00

 4 | (865,1) | 16 | f | f | 02 00 00 00 00 00 00 00

 5 | (844,1) | 16 | f | f | 03 00 00 00 00 00 00 00

 6 | (849,1) | 16 | f | f | 04 00 00 00 00 00 00 00

 7 | (18,1) | 16 | f | f | 05 00 00 00 00 00 00 00

 8 | (95,1) | 16 | f | f | 06 00 00 00 00 00 00 00

 9 | (63,1) | 16 | f | f | 07 00 00 00 00 00 00 00

 10 | (34,1) | 16 | f | f | 08 00 00 00 00 00 00 00

 11 | (851,1) | 16 | f | f | 09 00 00 00 00 00 00 00

 12 | (10,1) | 16 | f | f | 0a 00 00 00 00 00 00 00

 13 | (71,1) | 16 | f | f | 0b 00 00 00 00 00 00 00

 14 | (774,1) | 16 | f | f | 0c 00 00 00 00 00 00 00

 15 | (213,1) | 16 | f | f | 0d 00 00 00 00 00 00 00

 16 | (881,1) | 16 | f | f | 0e 00 00 00 00 00 00 00

 17 | (837,1) | 16 | f | f | 0f 00 00 00 00 00 00 00

 18 | (100,1) | 16 | f | f | 10 00 00 00 00 00 00 00

 19 | (872,1) | 16 | f | f | 11 00 00 00 00 00 00 00

 20 | (32,1) | 16 | f | f | 12 00 00 00 00 00 00 00

 21 | (65,1) | 16 | f | f | 13 00 00 00 00 00 00 00

 22 | (870,1) | 16 | f | f | 14 00 00 00 00 00 00 00

 23 | (841,1) | 16 | f | f | 15 00 00 00 00 00 00 00

 24 | (850,1) | 16 | f | f | 16 00 00 00 00 00 00 00

 25 | (30,1) | 16 | f | f | 17 00 00 00 00 00 00 00

 26 | (91,1) | 16 | f | f | 18 00 00 00 00 00 00 00

 27 | (829,1) | 16 | f | f | 19 00 00 00 00 00 00 00

 28 | (16,1) | 16 | f | f | 1a 00 00 00 00 00 00 00

 29 | (784,1) | 16 | f | f | 1b 00 00 00 00 00 00 00

 30 | (31,1) | 16 | f | f | 1c 00 00 00 00 00 00 00

 31 | (88,1) | 16 | f | f | 1d 00 00 00 00 00 00 00

 32 | (48,1) | 16 | f | f | 1e 00 00 00 00 00 00 00

 33 | (822,1) | 16 | f | f | 1f 00 00 00 00 00 00 00

 34 | (817,1) | 16 | f | f | 20 00 00 00 00 00 00 00

 35 | (109,1) | 16 | f | f | 21 00 00 00 00 00 00 00

(35 rows)

postgres=# select * from bt_page_items(test03_pkey,1);

 itemoffset | ctid | itemlen | nulls | vars | data 

------------+------------+---------+-------+------+-------------------------

 1 | (57342,14) | 16 | f | f | 01 00 00 00 00 00 00 00

 2 | (71195,14) | 16 | f | f | 01 00 00 00 00 00 00 00

(2 rows)

postgres=# select * from bt_page_items(test03_pkey,99);

 itemoffset | ctid | itemlen | nulls | vars | data 

------------+------------+---------+-------+------+-------------------------

 1 | (66214,10) | 16 | f | f | 02 00 00 00 00 00 00 00

 2 | (71209,3) | 16 | f | f | 02 00 00 00 00 00 00 00

(2 rows)

src/backend/access/nbtree/nbtpage.c

/*

 * Unlink a page in a branch of half-dead pages from its siblings.

 * If the leaf page still has a downlink pointing to it, unlinks the highest

 * parent in the to-be-deleted branch instead of the leaf page. To get rid

 * of the whole branch, including the leaf page itself, iterate until the

 * leaf page is deleted.

 * Returns false if the page could not be unlinked (shouldnt happen).

 * If the (new) right sibling of the page is empty, *rightsib_empty is set

 * to true.

static bool

_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)

 * Mark the page itself deleted. It can be recycled when all current

 * transactions are gone. Storing GetTopTransactionId() would work, but

 * were in VACUUM and would not otherwise have an XID. Having already

 * updated links to the target, ReadNewTransactionId() suffices as an

 * upper bound. Any scan having retained a now-stale link is advertising

 * in its PGXACT an xmin less than or equal to the value we read here. It

 * will continue to do so, holding back RecentGlobalXmin, for the duration

 * of that scan.

 page = BufferGetPage(buf);

 opaque = (BTPageOpaque) PageGetSpecialPointer(page);

 opaque- btpo_flags = ~BTP_HALF_DEAD;

 opaque- btpo_flags |= BTP_DELETED;

 opaque- btpo.xact = ReadNewTransactionId();

...

contrib/pageinspect/btreefuncs.c

 if (P_ISDELETED(opaque))

 elog(NOTICE, "page is deleted");

1. b-tree原理
https://yq.aliyun.com/articles/54437

1. 频繁更新的表,数据库的优化手段
1.1 监控长事务,绝对控制长事务

1.2 缩小autovacuum naptime (to 1s) ,

 增加autovacuum work (to 10), 

 设置autovacuum delay=0, 

 增大autovacuum work memory (to 512MB or bigger), 

 将经常变更的表和索引放到好的iops的设备上 。 

 不要小看这几个参数,非常的关键。 

1.3 如果事务释放并且表上面已经出发了vacuum后,还是要查很多的PAGE,说明index page没有delete和收缩,可能是index page没有达到compact的要求,如果遇到这种情况,需要reindex。

2. PostgreSQL 9.6通过快照过旧彻底解决这个长事务引发的坑爹问题。
9.6 vacuum的改进如图
screenshot

如何判断snapshot too old如图
screenshot

https://www.postgresql.org/docs/9.6/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR

3. 9.6的垃圾回收机制也还有改进的空间,做到更细粒度的版本控制,改进方法以前分享过,在事务列表中增加记录事务隔离级别,通过隔离级别判断需要保留的版本,而不是简单的通过最老事务来判断需要保留的垃圾版本。

祝大家玩得开心,欢迎随时来 阿里云促膝长谈 业务需求 ,恭候光临。

阿里云的小伙伴们加油,努力做 最贴地气的云数据库 。


PostgreSQL 垃圾回收参数优化之 - maintenance_work_mem , autovacuum_work_mem PostgreSQL , 垃圾回收 , 索引扫描 , 内存 夜谈PostgreSQL 垃圾回收参数优化之 - maintenance_work_mem , autovacuum_work_mem。 http://www.postgres.cn/v2/news/viewone/1/398 https://rhaas.blogspot.com/2019/01/how-much
PostgreSQL物理 备库 的哪些操作或配置,可能影响 主库 的性能、垃圾回收、IO波动 PostgreSQL , 物理复制 , 垃圾回收 , vacuum_defer_cleanup_age , hot_standby_feedback , max_standby_archive_delay , max_standby_streaming_delay PostgreSQL 物理备库的哪些配置,或者哪些操作,可能影响到主库呢? 首先,简单介绍一下PostgreSQL的物理备库,物理备库就是基于PostgreSQL WAL流式复制,实时恢复的备库。
阿里云数据库专家于巍荣获PostgreSQL中国技术大会“最具价值专家 MVP”奖 2023年3月3日,在由PostgreSQL中文社区主办的“第十二届PostgreSQL中国技术大会”上,阿里云数据库开源首席架构师于巍荣获“中国 PostgreSQL 最具价值专家 MVP”奖项。
喜讯!阿里云数据库PolarDB荣获第12届PostgreSQL中国技术大会“开源数据库杰出贡献奖” 2023年3月3日,由PostgreSQL中文社区主办的“第十二届PostgreSQL中国技术大会”在杭州隆重开幕。本次大会以“突破·进化·共赢 —— 安全可靠,共建与机遇”为主题,邀请了来自国内外的行业专家、企业家代表等,共见创新成果、共论国产数据库发展机遇。会上,PolarDB for PostgreSQL(简称PolarDB-PG )凭借在数据库开源领域的深耕布局和产品技术实力,荣获“开源数据库杰出贡献奖”。
《RDS数据库入门一本通》电子版地址 本书体系完整,理论充实,内容由浅入深,循序渐进,更配有详细的实例解说,为初学者提供一个完整、严密的思维框架,零基础的你也可轻松掌握RDS数据库的美妙节奏。