zl程序教程

您现在的位置是:首页 >  其他

当前栏目

海量数据,海明距离高效检索(smlar) - 阿里云RDS PosgreSQL最佳实践

阿里数据 实践 高效 最佳 距离 检索 海量
2023-09-11 14:21:46 时间
背景

http://www.cnblogs.com/lushilin/p/6549665.html

SimHash的应用

通过上面的步骤,我们可以利用SimHash算法为每一个网页生成一个向量指纹,那么问题来了,如何判断2篇文本的相似性?
这里面主要应用到是海明距离。

(1)什么是海明距离
两个码字的对应比特取值不同的比特数称为这两个码字的海明距离。在一个有效编码集中,任意两个码字的海明距离的最小值称为该编码集的海明距离。举例如下:10101和00110从第一位开始依次有第一位、第四、第五位不同,则海明距离为3。

(2)海明距离的几何意义
n位的码字可以用n维空间的超立方体的一个顶点来表示。两个码字之间的海明距离就是超立方体两个顶点之间的一条边,而且是这两个顶点之间的最短距离。

(3)海明距离的应用场景
用于编码的检错和纠错

经过SimHash算法提取来的指纹(Simhash对长文本500字+比较适用,短文本可能偏差较大,具体需要根据实际场景测试),最后使用海明距离,求相似,在google的论文给出的数据中,64位的签名,在海明距离为3的情况下,可认为两篇文档是相似的或者是重复的,当然这个值只是参考值,针对自己的应用可能又不同的测试取值

到这里相似度问题基本解决,但是按这个思路,在海量数据几百亿的数量下,效率问题还是没有解决的,因为数据是不断添加进来的,不可能每来一条数据,都要和全库的数据做一次比较,按照这种思路,处理速度会越来越慢,线性增长。

针对海量数据的去重效率,我们可以将64位指纹,切分为4份16位的数据块,根据抽屉原理在海明距离为3的情况,如果两个文档相似,那么它必有一个块的数据是相等的。

那么数据库是否支持这种高效率的检索呢?

反正PostgreSQL是支持的,通过黑科技smlar插件。

一、需求 二、架构设计

在PostgreSQL中,从海量数据中,搜索海明距离小于N的数据,有多种设计手段。每种方法的能耗比都不一样,读者可以按需选择。

1 暴力计算

1、单机多核并行计算,暴力扫描。采用阿里云RDS PostgreSQL 10提供的多核并行能力,暴力扫描。

2、多机多核并行计算,暴力扫描。采用阿里云HybridDB for PostgreSQL提供的多级并行计算能力,暴力扫描。

3、利用GPU、FPGA加速暴力运算。

PostgreSQL提供了扩展接口,可以利用GPU,FPGA的能力对数据进行计算。

4、利用CPU向量计算指令,暴力计算。

PostgreSQL提供了扩展接口,可以利用CPU向量计算指令的能力加速计算。

2 索引

索引是高效的做法,例如PostgreSQL smlar插件,在阿里的导购平台就有使用,用于实时导购文的海量相似度查询。

如果要让smlar加速海明距离的搜索,需要采用更科学的方法,比如切片。

直接使用位置,会有问题,因为smlar的第一道工序是块级收敛,而海明码是bit64的编码,在一个数据块中,有若干条记录,任何位置都可能同时出现0和1,任何数据块都包含0和1,因此无法完成第一道过滤。

我们可以采用切片,减少这种可能性。例如每2个BIT一片,或者每4个BIT一片,或者更多。

通常海明距离大于3的,就没有什么相关性了。

三、DEMO与性能 1 暴力计算

1、全扫,并行扫描

创建测试表

create table hm ( 

 id int, -- id 

 hmval bit(64) -- 海明HASH 

写入1000万测试数据

postgres=# insert into hm select id, val::int8::bit(64) from (select id, sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric as val from generate_series(1,10000000) t(id)) t; 

INSERT 0 10000000 

postgres=# select * from hm limit 10; 

 id | hmval 

----+------------------------------------------------------------------ 

 1 | 0000101001110110110101010111101011100110101010000111100011110111 

 2 | 0110011100110101101000001010101111010001011101100111111011001110 

 3 | 1010110111001011011110110000111111101101101111010111111100101110 

 4 | 0110011110110000001011000010010000101011100101010100111000101001 

 5 | 0101110100101111010110010110000000101110000010001011010110110000 

 6 | 0011010000100000101011011100000101111110010110111101100001100001 

 7 | 1011110011101101101000011101011101010111011001011010110111101000 

 8 | 1110010011000101001101110010001111110100001101010101111101110010 

 9 | 0110111111110011101001001000101101011011111100010010111010001111 

 10 | 0011100011000010101011010001111000000110100011100100111011011001 

(10 rows) 

设置暴力并行度

postgres=# set force_parallel_mode = on; 

postgres=# set min_parallel_table_scan_size = 0; 

postgres=# set parallel_setup_cost = 0; 

postgres=# set parallel_tuple_cost = 0; 

postgres=# alter table hm set (parallel_workers = 128); 

postgres=# set max_parallel_workers_per_gather = 64; 

并行查询海明距离小于4的记录,耗时463毫秒。

postgres=# select * from hm where length(replace(bitxor(bit0011110001011010110010001011010101001000111110000111110010010110, hmval)::text,0,)) 

 id | hmval 

----+------------------------------------------------------------------ 

 16 | 0011110001011010110010001011010101001000111110000111110010010110 

(1 row) 

Time: 463.314 ms 

非并行查询海明距离小于4的记录,耗时16秒。

postgres=# select * from hm where length(replace(bitxor(bit0011110001011010110010001011010101001000111110000111110010010110, hmval)::text,0,)) 

 id | hmval 

----+------------------------------------------------------------------ 

 16 | 0011110001011010110010001011010101001000111110000111110010010110 

(1 row) 

Time: 16791.215 ms (00:16.791) 

求两个BIT的不同位数,还有更高效率的方法。理论上可以达到100毫秒以内。

https://www.postgresql.org/message-id/flat/ab1ea6540903121110l2a3021d4h6632b206e2419898%40mail.gmail.com#ab1ea6540903121110l2a3021d4h6632b206e2419898@mail.gmail.com 

2 索引

阿里云RDS PostgreSQL提供了一个smlar插件,用于高效率的求数组的相似度。

我们需要将海明HASH,转换为数组,根据前面的设计,我们采用8个BIT一片的切法,支持索引查询海明距离为8以内的值。

切之前,验证一下切片后的过滤性:

postgres=# select relpages from pg_class where relname=hm; 

 relpages 

---------- 

 63695 

(1 row) 

1、单个片为1时,不用说,每个块都包含。 

postgres=# select count(*) from (select substring(ctid::text,(\d+),) from hm where substring(hmval,1,1)=0 group by 1)t; 

 count 

------- 

 63695 

(1 row) 

2、单个片为8时,有接近一半的块包含。 

postgres=# select count(*) from (select substring(ctid::text,(\d+),) from hm where substring(hmval,1,8)=00000000 group by 1)t; 

 count 

------- 

 29100 

(1 row) 

3、单个片为16时,只有100多个块包含了。 

postgres=# select count(*) from (select substring(ctid::text,(\d+),) from hm where substring(hmval,1,16)=0000000000000000 group by 1)t; 

 count 

------- 

 160 

(1 row) 

8片切法的性能验证

创建插件

create extension smlar; 

创建测试表

create table hm1 (id int, hmval bit(64), hmarr text[]); 

生成1000万测试数据,生成测试数据时,按切分手段进行切分,记录为TEXT数组。

insert into hm1 

select 

 id, 

 val::bit(64), 

 regexp_split_to_array(1_||substring(val,1,8)||,2_||substring(val,9,8)||,3_||substring(val,17,8)||,4_||substring(val,25,8)||,5_||substring(val,33,8)||,6_||substring(val,41,8)||,7_||substring(val,49,8)||,8_||substring(val,57,8), ,) 

from 

(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t; 

postgres=# select * from hm1 limit 10; 

 id | hmval | hmarr 

----+------------------------------------------------------------------+------------------------------------------------------------------------------------------- 

 1 | 0000001110101101100110011000100111100100001100100101101010010011 | {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011} 

 2 | 0001001000010101001100100010101010111001001000000110101101100100 | {1_00010010,2_00010101,3_00110010,4_00101010,5_10111001,6_00100000,7_01101011,8_01100100} 

 3 | 0011111111010100011001001010110110100010101110101001101111010000 | {1_00111111,2_11010100,3_01100100,4_10101101,5_10100010,6_10111010,7_10011011,8_11010000} 

 4 | 1100110010011001001110101110111111111111010000100000010011000010 | {1_11001100,2_10011001,3_00111010,4_11101111,5_11111111,6_01000010,7_00000100,8_11000010} 

 5 | 0011000011010001011111010101010111100110000110000011101100000101 | {1_00110000,2_11010001,3_01111101,4_01010101,5_11100110,6_00011000,7_00111011,8_00000101} 

 6 | 0111101101111110101000010110101101110011011110100100010111011001 | {1_01111011,2_01111110,3_10100001,4_01101011,5_01110011,6_01111010,7_01000101,8_11011001} 

 7 | 0010001011111111100010101011110001001101001011100100011000010000 | {1_00100010,2_11111111,3_10001010,4_10111100,5_01001101,6_00101110,7_01000110,8_00010000} 

 8 | 1110001111100011011110110111101111010101000111000100111111111101 | {1_11100011,2_11100011,3_01111011,4_01111011,5_11010101,6_00011100,7_01001111,8_11111101} 

 9 | 0111110010111000010111001000000101111000000110110110000011101110 | {1_01111100,2_10111000,3_01011100,4_10000001,5_01111000,6_00011011,7_01100000,8_11101110} 

 10 | 0111001101100010001101101111000000100100000000010001010011100101 | {1_01110011,2_01100010,3_00110110,4_11110000,5_00100100,6_00000001,7_00010100,8_11100101} 

(10 rows) 

创建smlar索引

postgres=# create index idx_hm1 on hm1 using gin(hmarr _text_sml_ops ); 

搜索海明距离小于等于1的VALUE。用到了smlar索引,耗时63毫秒。

postgres=# set smlar.type = overlap; 

postgres=# set smlar.threshold = 7; 

select 

 smlar( hmarr, {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}) 

 from 

 hm1 

 where 

 hmarr % {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011} 

 and length(replace(bitxor(bit0000001110101101100110011000100111100100001100100101101010010011, hmval)::text,0,)) 2 

 limit 100; 


smlar( hmarr, {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}) from hm1 where hmarr % {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011} and length(replace(bitxor(bit0000001110101101100110011000100111100100001100100101101010010011, hmval)::text,0,)) 2 limit 100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=117.83..420.48 rows=100 width=169) (actual time=62.928..62.929 rows=1 loops=1) Output: id, hmval, hmarr, (smlar(hmarr, {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}::text[])) Buffers: shared hit=166 - Bitmap Heap Scan on public.hm1 (cost=117.83..10205.17 rows=3333 width=169) (actual time=62.927..62.927 rows=1 loops=1) Output: id, hmval, hmarr, smlar(hmarr, {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}::text[]) Recheck Cond: (hm1.hmarr % {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}::text[]) Filter: (length(replace((bitxor(B0000001110101101100110011000100111100100001100100101101010010011::"bit", hm1.hmval))::text, 0::text, ::text)) 2) Heap Blocks: exact=1 Buffers: shared hit=166 - Bitmap Index Scan on idx_hm1 (cost=0.00..117.00 rows=10000 width=0) (actual time=62.898..62.898 rows=1 loops=1) Index Cond: (hm1.hmarr % {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}::text[]) Buffers: shared hit=165 Planning time: 0.147 ms Execution time: 62.975 ms (14 rows) postgres=# select smlar( hmarr, {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}) from hm1 where hmarr % {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011} and length(replace(bitxor(bit0000001110101101100110011000100111100100001100100101101010010011, hmval)::text,0,)) 2 limit 100; id | hmval | hmarr | smlar ----+------------------------------------------------------------------+-------------------------------------------------------------------------------------------+------- 1 | 0000001110101101100110011000100111100100001100100101101010010011 | {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011} | 8 (1 row) Time: 61.227 ms

如果我们只需要查询4以内的海明距离,实际上可以使用16的分组,或者我们可以使用混合切法。

6片混合切法的性能验证

切法为8,16,8,8,16,8。支持海明距离6以内的查询。

create table hm2 (id int, hmval bit(64), hmarr text[]); 

insert into hm2 

select 

 id, 

 val::bit(64), 

 regexp_split_to_array(1_||substring(val,1,8)||,2_||substring(val,9,16)||,3_||substring(val,25,8)||,4_||substring(val,33,8)||,5_||substring(val,41,16)||,6_||substring(val,57,8), ,) 

from 

(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t; 

postgres=# select * from hm2 limit 10; 

 id | hmval | hmarr 

----+------------------------------------------------------------------+------------------------------------------------------------------------------------- 

 1 | 1100111011000001100100100111111110100011100111111101101001101010 | {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010} 

 2 | 0111111000101011000111010011011000000010010001111001000111011101 | {1_01111110,2_0010101100011101,3_00110110,4_00000010,5_0100011110010001,6_11011101} 

 3 | 0111111000101111000101011100100000001111011101101100110100000101 | {1_01111110,2_0010111100010101,3_11001000,4_00001111,5_0111011011001101,6_00000101} 

 4 | 0111010101010010100000110001100011110010111000001011000010010010 | {1_01110101,2_0101001010000011,3_00011000,4_11110010,5_1110000010110000,6_10010010} 

 5 | 1111101100110100101111000011001011111110111000100110101001100001 | {1_11111011,2_0011010010111100,3_00110010,4_11111110,5_1110001001101010,6_01100001} 

 6 | 0011110000100010101001000001100010000010111011100010011001000110 | {1_00111100,2_0010001010100100,3_00011000,4_10000010,5_1110111000100110,6_01000110} 

 7 | 0000111111001110100110011110000110001101110111111111111010111001 | {1_00001111,2_1100111010011001,3_11100001,4_10001101,5_1101111111111110,6_10111001} 

 8 | 0110100010010100111100110110000011101110101001001111010101011111 | {1_01101000,2_1001010011110011,3_01100000,4_11101110,5_1010010011110101,6_01011111} 

 9 | 0111001111001100101011001001100100000000111100000110110001000011 | {1_01110011,2_1100110010101100,3_10011001,4_00000000,5_1111000001101100,6_01000011} 

 10 | 1101111101011000111100101010101000100001101100101110100001111000 | {1_11011111,2_0101100011110010,3_10101010,4_00100001,5_1011001011101000,6_01111000} 

(10 rows) 

create index idx_hm2 on hm2 using gin(hmarr _text_sml_ops ); 

查询海明距离小于等于1的值,提高到2毫秒了。

postgres=# set smlar.type = overlap; 

postgres=# set smlar.threshold = 5; 

postgres=# select 

 smlar( hmarr, {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}) 

 from 

 hm2 

 where 

 hmarr % {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010} 

 and length(replace(bitxor(bit1100111011000001100100100111111110100011100111111101101001101010, hmval)::text,0,)) 2 

 limit 100; 

 id | hmval | hmarr | smlar 

----+------------------------------------------------------------------+-------------------------------------------------------------------------------------+------- 

 1 | 1100111011000001100100100111111110100011100111111101101001101010 | {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010} | 6 

(1 row) 

Time: 1.954 ms 

postgres=# explain (analyze,verbose,timing,costs,buffers) select 

 smlar( hmarr, {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}) 

 from 

 hm2 

 where 

 hmarr % {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010} 

 and length(replace(bitxor(bit1100111011000001100100100111111110100011100111111101101001101010, hmval)::text,0,)) 2 

 limit 100; 

 QUERY PLAN 

------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

 Limit (cost=103.83..406.06 rows=100 width=153) (actual time=2.414..2.416 rows=1 loops=1) 

 Output: id, hmval, hmarr, (smlar(hmarr, {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}::text[])) 

 Buffers: shared hit=102 

 - Bitmap Heap Scan on public.hm2 (cost=103.83..10177.17 rows=3333 width=153) (actual time=2.414..2.415 rows=1 loops=1) 

 Output: id, hmval, hmarr, smlar(hmarr, {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}::text[]) 

 Recheck Cond: (hm2.hmarr % {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}::text[]) 

 Filter: (length(replace((bitxor(B1100111011000001100100100111111110100011100111111101101001101010::"bit", hm2.hmval))::text, 0::text, ::text)) 2) 

 Heap Blocks: exact=1 

 Buffers: shared hit=102 

 - Bitmap Index Scan on idx_hm2 (cost=0.00..103.00 rows=10000 width=0) (actual time=2.374..2.374 rows=1 loops=1) 

 Index Cond: (hm2.hmarr % {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}::text[]) 

 Buffers: shared hit=101 

 Planning time: 0.149 ms 

 Execution time: 2.463 ms 

(14 rows) 

4片切法的性能验证
create table hm3 (id int, hmval bit(64), hmarr text[]); 

insert into hm3 

select 

 id, 

 val::bit(64), 

 regexp_split_to_array(1_||substring(val,1,16)||,2_||substring(val,17,16)||,3_||substring(val,33,16)||,4_||substring(val,41,16), ,) 

from 

(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t; 

postgres=# select * from hm3 limit 10; 

 id | hmval | hmarr 

----+------------------------------------------------------------------+------------------------------------------------------------------------------- 

 1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} 

 2 | 1101011000010000000000000000111011011111011101110100000010101011 | {1_1101011000010000,2_0000000000001110,3_1101111101110111,4_0111011101000000} 

 3 | 0101000010110110110010001010100010101001001010111111011000110011 | {1_0101000010110110,2_1100100010101000,3_1010100100101011,4_0010101111110110} 

 4 | 0111000111100011111000100111000011101111110000011110101101000100 | {1_0111000111100011,2_1110001001110000,3_1110111111000001,4_1100000111101011} 

 5 | 0010111010101011111010011110110010011110111111110011101110010011 | {1_0010111010101011,2_1110100111101100,3_1001111011111111,4_1111111100111011} 

 6 | 0110111110011100100110010111010000000011100011000011110001010110 | {1_0110111110011100,2_1001100101110100,3_0000001110001100,4_1000110000111100} 

 7 | 0100110100111001110011011110100111101110101001000101010110110110 | {1_0100110100111001,2_1100110111101001,3_1110111010100100,4_1010010001010101} 

 8 | 0110010111001100111000011011011100001100111111101111011010100010 | {1_0110010111001100,2_1110000110110111,3_0000110011111110,4_1111111011110110} 

 9 | 0110111010110000001010101111000101110000010011100011100101000100 | {1_0110111010110000,2_0010101011110001,3_0111000001001110,4_0100111000111001} 

 10 | 0101101000000110100101100011111111000101110001010011100110101011 | {1_0101101000000110,2_1001011000111111,3_1100010111000101,4_1100010100111001} 

(10 rows) 

create index idx_hm3 on hm3 using gin(hmarr _text_sml_ops ); 

查询海明距离小于等于1的值,提高到0.2毫秒了。

postgres=# set smlar.type = overlap; 

postgres=# set smlar.threshold = 3; 

postgres=# select 

 smlar( hmarr, {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}) 

 from 

 hm3 

 where 

 hmarr % {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} 

 and length(replace(bitxor(bit0101011111111010000001001011101101100011111101111101101100000011, hmval)::text,0,)) 2 

 limit 100; 

 id | hmval | hmarr | smlar 

----+------------------------------------------------------------------+-------------------------------------------------------------------------------+------- 

 1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} | 4 

(1 row) 


smlar( hmarr, {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}) from hm3 where hmarr % {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} and length(replace(bitxor(bit0101011111111010000001001011101101100011111101111101101100000011, hmval)::text,0,)) 2 limit 100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=99.83..401.19 rows=100 width=134) (actual time=0.169..0.170 rows=1 loops=1) Output: id, hmval, hmarr, (smlar(hmarr, {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}::text[])) Buffers: shared hit=14 - Bitmap Heap Scan on public.hm3 (cost=99.83..10144.17 rows=3333 width=134) (actual time=0.168..0.169 rows=1 loops=1) Output: id, hmval, hmarr, smlar(hmarr, {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}::text[]) Recheck Cond: (hm3.hmarr % {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}::text[]) Filter: (length(replace((bitxor(B0101011111111010000001001011101101100011111101111101101100000011::"bit", hm3.hmval))::text, 0::text, ::text)) 2) Heap Blocks: exact=1 Buffers: shared hit=14 - Bitmap Index Scan on idx_hm3 (cost=0.00..99.00 rows=10000 width=0) (actual time=0.145..0.145 rows=1 loops=1) Index Cond: (hm3.hmarr % {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}::text[]) Buffers: shared hit=13 Planning time: 0.101 ms Execution time: 0.200 ms (14 rows)

查询海明距离小于等于4的,依旧在毫秒返回。

postgres=# set smlar.type = overlap; 

postgres=# set smlar.threshold = 0; 

postgres=# select 

 smlar( hmarr, {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}) 

 from 

 hm3 

 where 

 hmarr % {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} 

 and length(replace(bitxor(bit0101011111111010000001001011101101100011111101111101101100000011, hmval)::text,0,)) 5 

 limit 100; 

 id | hmval | hmarr | smlar 

----+------------------------------------------------------------------+-------------------------------------------------------------------------------+------- 

 1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} | 4 

(1 row) 

Time: 6.983 ms 

不使用索引,23秒。

postgres=# select * from hm3 where length(replace(bitxor(bit0101011111111010000001001011101101100011111101111101101100000011, hmval)::text,0,)) 

 id | hmval | hmarr 

----+------------------------------------------------------------------+------------------------------------------------------------------------------- 

 1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} 

(1 row) 

Time: 22954.686 ms 

相比没有索引的情况,性能从23秒提升到了0.2毫秒。提升了11.48万倍。

自动切分

有人会说,怎么自动生成simhash切分后的数组呢?

不用担心,PostgreSQL提供了强大的UDF功能,可以建立UDF和TRIGGER,在写入数据时,自动生成切分后的数组。

例子

create table hm4 (id int, hmval bit(64), hmarr text[]); 

create or replace function sp(val bit(64)) returns text[] as $$

select regexp_split_to_array(1_||substring(val::text,1,10)||,2_||substring(val::text,11,10)||,3_||substring(val::text,21,10)||,4_||substring(val::text,31,10)||,5_||substring(val::text,41,10)||,6_||substring(val::text,51,14), ,) ; 

$$ language sql strict;

postgres=# select sp(123::bit(64));

-------------------------------------------------------------------------------------

 {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000001111011}

(1 row)

-- 写入1亿记录

insert into hm4 

select 

 id, 

 val::bit(64), 

 sp(val::bit(64)) 

from 

(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,100000000) t(id)) t; 

-- 索引

create index idx_hm4 on hm4 using gin(hmarr _text_sml_ops ); 

-- 查询海明距离小于等于1的记录,性能杠杠的

postgres=# set smlar.type = overlap; 

postgres=# set smlar.threshold = 5; 

select 

 smlar( hmarr, sp(123::bit(64))) -- 查询与123::bit(64)海明距离小于2的记录

 from 

 hm3 

 where 

 hmarr % sp(123::bit(64)) -- 查询与123::bit(64)海明距离小于2的记录

 and length(replace(bitxor(123::bit(64), hmval)::text,0,)) 2 -- 查询与123::bit(64)海明距离小于2的记录

 limit 100; 

创建触发器,写入simhash时,自动写入切分数组

create or replace function tg() returns trigger as $$

declare

begin

 NEW.hmarr := sp(NEW.hmval);

 return NEW;

$$ language plpgsql strict;

postgres=# create trigger tg before insert or update on hm4 for each row execute procedure tg();

CREATE TRIGGER

-- 效果很赞

postgres=# truncate hm4;

TRUNCATE TABLE

postgres=# insert into hm4 values (1,1::bit(64));

INSERT 0 1

postgres=# select * from hm4;

 id | hmval | hmarr 

----+------------------------------------------------------------------+-------------------------------------------------------------------------------------

 1 | 0000000000000000000000000000000000000000000000000000000000000001 | {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000000000001}

(1 row)

postgres=# update hm4 set hmval=123456::bit(64);

UPDATE 1

postgres=# select * from hm4;

 id | hmval | hmarr 

----+------------------------------------------------------------------+-------------------------------------------------------------------------------------

 1 | 0000000000000000000000000000000000000000000000011110001001000000 | {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000111,6_10001001000000}

(1 row)

爽就点个赞吧。

四、技术点

1、阿里云RDS PostgreSQL smlar插件,使用GIN索引,块级收敛,二重过滤。0.2毫秒的响应速度,1000万数据中,检索海明距离2以内的记录。

2、阿里云RDS PostgreSQL 10,使用多核并行,暴力扫描,1000万数据,检索海明距离为N以内的数据,约450毫秒。

3、阿里云HybridDB for PostgreSQL,使用多机并行,横向扩展计算能力,也可以做到暴力扫描。根据实际的节点数计算查询效率。

五、云端产品

阿里云 RDS PostgreSQL

阿里云 HybridDB for PostgreSQL

六、类似场景、案例

《电商内容去重\内容筛选应用(实时识别转载\盗图\侵权?) - 文本、图片集、商品集、数组相似判定的优化和索引技术》

《基于 阿里云 RDS PostgreSQL 打造实时用户画像推荐系统》

七、小结

采用阿里云RDS PostgreSQL的SMLAR插件,对千万量级的simhash数据检索相似文本,(更多数据量的测试后续提供,响应速度应该在毫秒级),相比没有索引的情况,性能从23秒提升到了0.2毫秒。提升了11.48万倍。

开不开心,意不意外。

八、参考

《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》

《PostgreSQL结合余弦、线性相关算法 在文本、图片、数组相似 等领域的应用 - 3 rum, smlar应用场景分析》

《PostgreSQL结合余弦、线性相关算法 在文本、图片、数组相似 等领域的应用 - 2 smlar插件详解》

《PostgreSQL结合余弦、线性相关算法 在文本、图片、数组相似 等领域的应用 - 1 文本(关键词)分析理论基础 - TF(Term Frequency 词频)/IDF(Inverse Document Frequency 逆向文本频率)》

《17种文本相似算法与GIN索引 - pg_similarity》

《电商内容去重\内容筛选应用(实时识别转载\盗图\侵权?) - 文本、图片集、商品集、数组相似判定的优化和索引技术》

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

《PostgreSQL (varbit, roaring bitmap) VS pilosa(bitmap库)》

《阿里云RDS for PostgreSQL varbitx插件与实时画像应用场景介绍》

《基于 阿里云 RDS PostgreSQL 打造实时用户画像推荐系统》

《块级(ctid)扫描在IoT(物联网)极限写和消费读并存场景的应用》


一图读懂阿里云RDS架构与选型 阿里云数据库RDS也发布了很多新的特性与能力,包括RDS集群版、Serverless、ARM支持等,另外,之前的版本也缺少了数据库代理,云盘类型等。这里一并进行更新,发布了新的v2版本。
阿里云PolarDB、RDS获评信通院数据库Serverless认证最高“先进级”,AnalyticDB获“增强级” 在日前中国信通院组织的数据库产品能力评测中,阿里云PolarDB for MySQL、RDS MySQL数据库顺利完成了首批事务型数据库Serverless能力分级测试,获最高“先进级”评级;AnalyticDB MySQL和AnalyticDB PostgreSQL顺利完成了首个分析型数据库Serverless能力分级测试,获评“增强级”评级。
阿里云关系型数据库RDS存储类型区别(ESSD云盘、本地SSD盘和SSD云盘) 阿里云关系型数据库RDS(Relational Database Service)是一种稳定可靠、可弹性伸缩的在线数据库服务。云数据库RDS提供三种数据存储类型:ESSD云盘、本地SSD盘和SSD云盘,本文介绍三种存储类型的区别及选购建议。