lightdb/postgresql toast解析&页面物理组织
先postgresql的页结构、页物理布局
源码定义如下:
struct HeapTupleHeaderData { union { HeapTupleFields t_heap; DatumTupleFields t_datum; }t_choice; ItemPointerData t_ctid;// uint16 t_infomask2; /* number of attributes + various flags */ uint16 t_infomask; /* various flag bits, see below */ uint8 t_hoff; /* sizeof header incl. bitmap, padding */ bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs */ };
铺开来看的话:
Field | Type | Length | Description |
---|---|---|---|
t_xmin | TransactionId | 4 bytes | insert XID stamp |
t_xmax | TransactionId | 4 bytes | delete XID stamp |
t_cid | CommandId | 4 bytes | insert and/or delete CID stamp (overlays with t_xvac) |
t_xvac | TransactionId | 4 bytes | XID for VACUUM operation moving a row version |
t_ctid | ItemPointerData | 6 bytes | current TID of this or newer row version |
t_infomask2 | uint16 | 2 bytes | number of attributes, plus various flag bits |
t_infomask | uint16 | 2 bytes | various flag bits |
t_hoff | uint8 | 1 byte | offset to user data |
所有字段的详情定义在src/include/access/htup_details.h中(这里面的信息对开发非常重要)。
TOAST块在shared_buffer中也是压缩状态,只有在访问时才会被解压。
es_test=# \d+ big_search_doc_new_ic Table "public.big_search_doc_new_ic" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------------------+-------------------------+-----------+----------+---------+----------+--------------+------------- summary | character varying(1024) | | | | extended | | sentiment | character varying(100) | | | | extended | | industryname | character varying(100) | | | | extended | | secucode | character varying(16) | | | | extended | | filename | character varying(100) | | | | extended | | industrycode | character varying(16) | | | | extended | | format | character varying(100) | | | | extended | | reporttype | character varying(16) | | | | extended | | fstoregroup | character varying(16) | | | | extended | | title | character varying(100) | | | | extended | | docid | character varying(36) | | | | extended | | content | text | | | | extended | | industrychains | character varying[] | | | | extended | | investranking | character varying(16) | | | | extended | | industrychainnodes | character varying[] | | | | extended | | docsource | character varying(16) | | | | extended | | secuname | character varying(16) | | | | extended | | fstorepath | character varying(64) | | | | extended | | riskranking | character varying(16) | | | | extended | | orgname | character varying(32) | | | | extended | | publishdate | character varying(32) | | | | extended | | pagecount | integer | | | | plain | | authors | character varying[] | | | | extended | | Access method: heap
es_test=# select oid,relname,reltoastrelid,relpages,reltuples from pg_class where relname='big_search_doc_new_ic'; oid | relname | reltoastrelid | relpages | reltuples --------+-----------------------+---------------+----------+-------------- 335878 | big_search_doc_new_ic | 335881 | 618489 | 3.210117e+06 es_test=# select oid,relname,reltoastrelid,relpages,reltuples from pg_class where oid=335881; -- TOAST比表大是正常的 oid | relname | reltoastrelid | relpages | reltuples --------+-----------------+---------------+----------+-------------- 335881 | pg_toast_335878 | 0 | 18619610 | 6.538884e+07
select a.oid, a.relname, a.reltoastrelid, a.relpages, a.reltuples,
a.relfilenode, b.oid, b.relname, b.reltoastrelid, b.relpages,
b.relfilenode, b.reltuples from pg_class a, pg_class b where a.relname = 'data' and a.reltoastrelid = b.oid;
es_test=# select * from pg_toast_335878 limit 1; # TOAST表不能直接访问。 ERROR: relation "pg_toast_335878" does not exist LINE 1: select * from pg_toast_335878 limit 1; ^
es_test=# SELECT tuple_data_split('big_search_doc_new_ic'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('big_search_doc_new_ic', 0)) limit 1; tuple_data_split ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------- {"\\x840b0000e4bcb4e99a8fe79d80e4b8ade59bbde4bc81e4b89ae695b0e5ad97e58c96e58d87e7baa7e79a84e5a4a7e8b68be58abfefbc8c5361615320e8a18ce4b89ae5b086e698afe995bfe69c9fe58f97e79b8ae79a84e4bc98e8b4a 8e8b59be98193e38082e4b88ee7be8ee59bbd205361615320e8a18ce4b89ae79a84e7bea4e6989fe997aae88080e79bb8e6af94efbc8ce4b8ade59bbd205361615320e5b882e59cbae8bf98e5a484e4ba8ee58f91e5b195e697a9e69c9fefbc 8ce4bd86e99a8fe79d80e68891e59bbde4ba91e8aea1e7ae97e59fbae7a180e8aebee696bde98090e6b890e68890e7869fefbc8ce4b8ade59bbd2053616153e4b99fe5b086e8bf8ee69da5e799bee88ab1e9bd90e694bee38082e79bb8e5afb 9e4ba8ee2809ce88a82e6b581e59e8be2809d53616153efbc8ce68891e4bbace69bb4e79c8be5a5bde2809ce5bc80e6ba90e59e8be2809d5361615320e79a84e58f91e5b195e380825361615320e8a18ce4b89ae794b1e9809ae794a8e7b1bb e98090e6b890e59091e59e82e79bb4e8a18ce4b89ae7b1bbe6b897e9808fefbc8ce69bb4e5a49ae7bb86e58886e8a18ce4b89ae58685e68896e5b086e6b68ce78eb0e9be99e5a4b4e38082e68891e4bbace9a696e6aca1e8a686e79b96e4b8a de59bbd205361615320e8a18ce4b89ae5928ce585ade5aeb6e4bc98e8b4a8e585ace58fb8e6a087e79a84efbc8ce59d87e7bb99e4ba88e2809ce4b9b0e585a5e2809de8af84e7baa7efbc8ce68ea8e88d90e9a1bae5ba8fe4be9de6aca1e4b8 bae4b8ade59bbde69c89e8b59eefbc88383038332e484befbc89e38081e5beaee79b9fe99b86e59ba2efbc88323031332e484befbc89e38081e98791e5b1b1e8bdafe4bbb6efbc88333838382e484befbc89e38081e794a8e58f8be7bd91e7b b9cefbc883630303538382e5348efbc89e38081e98791e89db6e59bbde99985efbc883236382e484befbc89e5928ce6988ee6ba90e4ba91efbc883930392e484befbc89e380820a",NULL,NULL,NULL,"\\x49e4b8ade59bbd53616153e8a18 ce4b89aefbc9ae9a38ee887b3e4ba91e8b5b72e706466",NULL,NULL,"\\x1be585b6e4bb96e7a094e68aa5","\\x0f67726f757031","\\x41e4b8ade59bbd53616153e8a18ce4b89aefbc9ae9a38ee887b3e4ba91e8b5b7","\\x43666232 6564633232643931623462303439343466636464366137393833666631","\\x01123d470500cd1402000e20050009200500","\\x1b000000000000000013040000","\\x0fe8b685e9858d","\\x1b000000000000000013040000","\\x0 f6673746f7265","\\x03","\\x5b4d30302f30302f30302f4368516775315f354f3265414c5161484146354e72562d5a4639343031352e706466",NULL,"\\x27e6b5a6e993b6e59bbde99985e8af81e588b8","\\x33323032302d31322d3 2315430303a30303a30302b30383030","\\x88000000","\\x7b010000000000000013040000030000000100000034000000e69da8e5ad90e8b68500000028000000e69e97e790b0000028000000e8b5b5e4b8b90000","\\x0112564a0100 524a01000d20050009200500"} (1 row)
https://www.postgresql.org/docs/current/pageinspect.html#id-1.11.7.31.5
http://www.mytecdb.com/blogDetail.php?id=223
TOAST字段(可变长度类型字段,含varchar、bytea、jsonb等,不一定非得超过2000字节)的存储选项
默认情况下,当字段超过2KB的时候TOAST就会尝试压缩宽列,如果压缩之后仍然超过2KB,就会将宽字段分块,源表通过指针指向分块后的TOAST,如果低于2KB,则直接存储。所以有些宽列压缩后存储在行内,有些在TOAST是可能的。
ALTER TABLE ... SET (toast_tuple_target = N) 默认为2KB,支持表级别设置。
ALTER [ COLUMN ] ALTER TABLE ...
column_name
SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }
-
PLAIN
prevents either compression or out-of-line storage; furthermore it disables use of single-byte headers for varlena types. This is the only possible strategy for columns of non-TOAST-able data types. -
EXTENDED
allows both compression and out-of-line storage. This is the default for most TOAST-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big. -
EXTERNAL
allows out-of-line storage but not compression. Use ofEXTERNAL
will make substring operations on widetext
andbytea
columns faster (at the penalty of increased storage space) because these operations are optimized to fetch only the required parts of the out-of-line value when it is not compressed. -
MAIN
allows compression but not out-of-line storage. (Actually, out-of-line storage will still be performed for such columns, but only as a last resort when there is no other way to make the row small enough to fit on a page.
alter table data add column big_data_plain text; alter table data alter column big_data_plain set storage plain; ## plain需要注意最大长度不能超过8160,否则会报错:
zjh@postgres=# alter table t alter column v set storage plain; ALTER TABLE zjh@postgres=# zjh@postgres=# zjh@postgres=# begin; BEGIN zjh@postgres=*# select pg_backend_pid(); pg_backend_pid ---------------- 175641 (1 row) zjh@postgres=*# insert into t values(rpad('y',1000000,'y')); ERROR: row is too big: size 1000032, maximum size 8160 在raw_heap_insert函数中判断
alter table data add column big_data_external text; alter table data alter column big_data_external set storage external; alter table data add column big_data_main text; alter table data alter column big_data_main set storage main;
update data set big_data_plain = big_column,big_data_external = big_column,big_data_main = big_column;
analyze data;
可通过select attname,attstorage from pg_catalog.pg_attribute where attrelid = 24594;查询列存储格式,可知已经是指定格式。
id |p | data |x | big_column |x | big_data_plain |p | big_data_external|e | big_data_main |m |
但是查询toast关系大小,仍然是0页?难道在主表中?
oid |relname|reltoastrelid|relpages|reltuples|relfilenode|oid |relname |reltoastrelid|relfilenode|relpages|reltuples| -----+-------+-------------+--------+---------+-----------+-----+--------------+-------------+-----------+--------+---------+ 24594|data | 24598| 1| 16.0| 57388|24598|pg_toast_24594| 0| 57389| 0| 0.0|
查看对应的57389文件,大小不为空(应该算是缺陷)。
TOAST压缩
在PG中,针对TOAST,默认使用的是LZ压缩算法,实现为PGLZ。压缩得通常用法是在写磁盘和网络之前进行压缩,在应用接收和读取之后解压。
https://medium.com/@lk.snatch/postgresql-compression-854a4647ee43
pg 14为什么要增加LZ4算法选择
论压缩率,LZ4和PGLZ差异是不大的,但是LZ4速度相比PGLZ要更快,所以这在没有结果集缓存时非常重要,因为每次访问都要重复解压,单笔体现不出来。
https://stackoverflow.com/questions/67537111/how-do-i-decide-between-lz4-and-snappy-compression
postgres=# postgres=# insert into tab_compression_1 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 1274.374 ms (00:01.274) postgres=# postgres=# postgres=# postgres=# insert into tab_compression_1 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 1264.036 ms (00:01.264) postgres=# postgres=# postgres=# postgres=# insert into tab_compression_1 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 1270.027 ms (00:01.270) postgres=# postgres=# postgres=# insert into tab_compression_2 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 178.854 ms postgres=# insert into tab_compression_2 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 178.418 ms postgres=# insert into tab_compression_2 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 179.050 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_1 ; x --- 1 (1 row) Time: 140.865 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_1 ; x --- 1 (1 row) Time: 136.520 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_1 ; x --- 1 (1 row) Time: 136.078 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_2 ; x --- 1 (1 row) Time: 109.513 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_2 ; x --- 1 (1 row) Time: 110.300 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_2 ; x --- 1 (1 row) Time: 111.693 ms
查询时间降低了20%。插入时间降低了86%。
那为什么默认不使用lz4呢?因为升级问题,如果直接从13及之前版本升级上来,数据就会破坏。
最后
最后,还需要一提的是,postgresql还支持真正的大对象large object,并且包含对应的API和数据字典,虽然现在一般都用TOAST代替了,但是其比TOAST更加灵活、当然也更加原始,可参见https://www.hs.net/lightdb/docs/html/largeobjects.html。
java保存图片到postgresql:https://www.enterprisedb.com/postgres-tutorials/postgresql-toast-and-working-blobsclobs-explained
现代LZ算法
https://glinscott.github.io/lz/index.html
参考
相关文章
- Why is VMware Not Compatible With Hyper-V & How to Fix It (windowsreport.com)
- kaniko & kubernetes 构建镜像
- 【前端学习之HTML&CSS】-- CSS第九篇 -- 多种多样的选择器
- 【性能测试】Jmeter工具使用3-练习题&详细答案
- C语言位域解析&符号位扩展规则
- 使用 useReducer 和 useCallback 解决 useEffect 依赖诚实与方法内置&外置问题
- VMware创建Linux虚拟机之(四)ZooKeeper&HBase完全分布式安装
- sequelize Getters, Setters & Virtuals - 获取器, 设置器 & 虚拟字段
- C&C++数组实训(国防科大)
- spin_lock & mutex_lock的差别?
- sql server修改链接服务器 Rpc &Rpc Out
- 学习笔记(35续):Python网络编程&并发编程-基于gevent及线程池实现的并发套接字通讯