您现在的位置是：首页 > 其它

当前栏目

Solr进行Distinct 获取Count

获取进行 count Solr Distinct

2023-09-11 14:14:06 时间

今天碰到一个问题,数据之前入solr的时候并没有计算条数,现在需要计算出某几个表中去重后的总数。
由于solr的ISearch并没有相关的Distinct功能.想到一个解决方案是用Solr的Facet分组进行GrupBy,但是因为Facet只能返回100条,而数据肯定大于100个分组.所有该方案PASS了。
后来在网上搜到Solr Count Distinct,这么一个东西,是Solr已经发布的脚本（Solr Search Requests）其中有类似的功能

A 100% accurate count of distinct values (count distinct) is not generally possible without actually observing all of the values together. However there are a number of ways to estimate the count.

“unique” Facet Function
The unique facet function is Solr’s fastest implementation to calculate the number of distinct values.
It always provides exact counts on a single Solr node. For distributed search over multiple nodes, it provides exact counts when the number of values per node does not exceed 100 (by default).

When the number of unique values does exceed 100 in any given shard, the following algorithm is used:

It estimates the count by sending the top 100 results from each shard along with the total exact “unique” count for each shard.
totalSeen is the number of actual results we saw from all shards (i.e. not deduped yet).
uniqueSeen is the number of unique values we saw from all shards (i.e. deduped).
notSeen is the number of unique values from each shard that were not sent (because of the 100 cutoff).
factor = uniqueSeen / totalSeen (i.e. what fraction of values that we saw were unique)
estimate = uniqueSeen + ( notSeen * factor ) (i.e. we simply apply the factor to the number of values we didn’t see)
Example use:

$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
json.facet={
  x : "unique(manu_exact)"    // manu_exact is the manufacturer indexed as a single string
}'

For more facet functions, adding facet functions to each facet bucket, or sorting by facet function, see Solr Facet Functions

Aggregation Functions
Faceting involves breaking up the domain into multiple buckets and providing information about each bucket.
There are multiple aggregation functions / statistics that can be used:

Aggregation	Example	Effect
sum	sum(sales)	summation of numeric values
avg	avg(popularity)	average of numeric values
sumsq	sumsq(rent)	sum of squares
min	min(salary)	minimum value
max	max(mul(price,popularity))	maximum value
unique	unique(state)	number of unique values (count distinct)
hll	hll(state)	number of unique values using the HyperLogLog algorithm
percentile	percentile(salary,50,75,99,99.9) calculates	percentiles

下面是我写的一个例子

curl http://192.168.1.1:8080/solr/xxshard/query?q=*:* -d '
    json.facet={
        x:"unique(RB040002)"
    }'

详细用法及其他功能在下面原文中

http://yonik.com/solr-count-distinct/
http://yonik.com/solr-facet-functions/

猜你喜欢

logstash-配置文件详解
经验模态分解法简析（转）
最后 3 天｜报名参加 OpenYurt+EdgeX 挑战赛，冲击最高 5 万元奖励！
增量更新
[PWA] Cache JSON Data in a React PWA with Workbox, and Display it while Offline
Qt通用方法及类库8
修改Word 2013最近打开的文档
【华为OD机试 2023】单词倒序（C++ Java JavaScript Python）
awk过滤统计不重复的行
运维基础之Redis（1）简介、安装、使用
成功解决f“Usecols do not match columns, columns expected but not found: {missing}“ ValueError: Usecols d
linux下查看所在网络的公网ip地址——筑梦之路
PropertyGrid控件分类（Category）及属性（Property）排序
Oracle 数据库备份
glibc: daemon
MongoDB 数据库安装
SQL通配符的技巧
Opencv学习笔记（六）SURF学习笔记

相关主题

java 获取 T.class
获取环境变量
C# 获取ip地址
获取本地IP
js获取URL参数
JS 获取URL参数
获取spring bean
41.(后端)获取用户列表
PHP获取客户端IP
JS获取当前日期
Js_获取当前日期时间
sql - 获取日期中的年
get获取元素
C获取当前时间
获取IP地址
获取系统
Java路径获取

zl程序教程

当前栏目

Solr进行Distinct 获取Count

相关文章