Solr进行Distinct 获取Count
今天碰到一个问题,数据之前入solr的时候并没有计算条数,现在需要计算出某几个表中去重后的总数。
由于solr的ISearch并没有相关的Distinct功能.想到一个解决方案是用Solr的Facet分组进行GrupBy,但是因为Facet只能返回100条,而数据肯定大于100个分组.所有该方案PASS了。
后来在网上搜到Solr Count Distinct,这么一个东西,是Solr已经发布的脚本(Solr Search Requests)其中有类似的功能
A 100% accurate count of distinct values (count distinct) is not generally possible without actually observing all of the values together. However there are a number of ways to estimate the count.
“unique” Facet Function
The unique facet function is Solr’s fastest implementation to calculate the number of distinct values.
It always provides exact counts on a single Solr node. For distributed search over multiple nodes, it provides exact counts when the number of values per node does not exceed 100 (by default).When the number of unique values does exceed 100 in any given shard, the following algorithm is used:
It estimates the count by sending the top 100 results from each shard along with the total exact “unique” count for each shard.
totalSeen is the number of actual results we saw from all shards (i.e. not deduped yet).
uniqueSeen is the number of unique values we saw from all shards (i.e. deduped).
notSeen is the number of unique values from each shard that were not sent (because of the 100 cutoff).
factor = uniqueSeen / totalSeen (i.e. what fraction of values that we saw were unique)
estimate = uniqueSeen + ( notSeen * factor ) (i.e. we simply apply the factor to the number of values we didn’t see)
Example use:
$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
json.facet={
x : "unique(manu_exact)" // manu_exact is the manufacturer indexed as a single string
}'
- 1
- 2
- 3
- 4
- 5
For more facet functions, adding facet functions to each facet bucket, or sorting by facet function, see Solr Facet Functions
Aggregation Functions
Faceting involves breaking up the domain into multiple buckets and providing information about each bucket.
There are multiple aggregation functions / statistics that can be used:
Aggregation | Example | Effect |
---|---|---|
sum | sum(sales) | summation of numeric values |
avg | avg(popularity) | average of numeric values |
sumsq | sumsq(rent) | sum of squares |
min | min(salary) | minimum value |
max | max(mul(price,popularity)) | maximum value |
unique | unique(state) | number of unique values (count distinct) |
hll | hll(state) | number of unique values using the HyperLogLog algorithm |
percentile | percentile(salary,50,75,99,99.9) calculates | percentiles |
下面是我写的一个例子
curl http://192.168.1.1:8080/solr/xxshard/query?q=*:* -d '
json.facet={
x:"unique(RB040002)"
}'
- 1
- 2
- 3
- 4
详细用法及其他功能在下面原文中
http://yonik.com/solr-count-distinct/
http://yonik.com/solr-facet-functions/
相关文章
- 传参以及获取传参
- php获取后面?id参数进行判断参数跳转目标站
- MFC 获取DC和输出文字、获取指定区域
- PHP获取QQ群成员QQ号码
- 获取磁盘的文件系统类型
- HubSpot company数据在UI上的展示和通过API方式进行获取
- Atitit.并发测试解决方案(2) -----获取随机数据库记录 随机抽取数据 随机排序 原理and实现
- 获取鼠标选中内容的值
- 第3篇 基础(三)解决Qt部件Line Edit动态获取用户输入问题
- Python获取某平台主播照片, 实现颜值检测, 进行排名
- System.Data.Entity.Core.EntityException: The underlying provider failed on Open. ---> System.InvalidOperationException: 超时时间已到。超时时间已到,但是尚未从池中获取连接。出现这种情况可能是因为所有池连接均在使用,并且达到了最大池大小。
- Java反射机制获取Class文件
- js解析json,js转换json成map,获取map的key,value
- Android AP模式下获取SSID/PASSWORD(反射机制
- app测试日志如何获取,logcat值得拥有
- Java如何从服务器获取文件大小?
- 从零开始学PowerShell(6)获取对象信息
- Android系统信息获取 之十四:获取WIFI热点相关信息