大叔算法分享(5)聚类算法DBSCAN
一 简介
DBSCAN:Density-based spatial clustering of applications with noise
is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.
二 原理
DBSCAN是一种基于密度的聚类算法,算法过程比较简单,即将相距较近的点(中心点和它的邻居点)聚成一个cluster,然后不断找邻居点的邻居点并加到这个cluster中,直到cluster无法再扩大,然后再处理其他未访问的点;
三 算法伪代码
子方法伪代码
DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region (minPts).
DBSCAN算法主要有两个参数,一个是距离Eps,一个是最小邻居的数量MinPts,即在中心点半径Eps之内的邻居点数量超过MinPts时,中心点和邻居点才可以组成一个cluster;
四 应用代码实现
python
示例代码
def main_fun(): loc_data = [(40.8379295833, -73.70228875), (40.750613794,-73.993434906), (40.6927066969, -73.8085984165), (40.7489736586, -73.9859616017), (40.8379525833, -73.70209875), (40.6997066969, -73.8085234165), (40.7484436586, -73.9857316017)] epsilon = 10 db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(loc_data)) labels = db.labels_ print(labels) print(db.core_sample_indices_) print(db.components_) n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) for i in range(0, n_clusters_): print(i) indexs = np.where(labels == i) for j in indexs: print(loc_data[j]) if __name__ == '__main__': main_fun()
主要结果说明
|
详见官方文档:https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
scala
依赖
<dependency>
<groupId>org.scalanlp</groupId>
<artifactId>nak_2.11</artifactId>
<version>1.3</version>
</dependency><dependency>
<groupId>org.scalanlp</groupId>
<artifactId>breeze_2.11</artifactId>
<version>0.13</version>
</dependency>
示例代码
import breeze.linalg.DenseMatrix import nak.cluster.{DBSCAN, GDBSCAN, Kmeans} val matrix = DenseMatrix( (40.8379295833, -73.70228875), (40.6927066969, -73.8085984165), (40.7489736586, -73.9859616017), (40.8379525833, -73.70209875), (40.6997066969, -73.8085234165), (40.7484436586, -73.9857316017), (40.750613794,-73.993434906)) val gdbscan = new GDBSCAN( DBSCAN.getNeighbours(epsilon = 1000.0, distance = Kmeans.euclideanDistance), DBSCAN.isCorePoint(minPoints = 1) ) val clusters = gdbscan cluster matrix clusters.foreach(cluster => { println(cluster.id + ", " + cluster.points.length) cluster.points.foreach(p => p.value.data.foreach(println)) })
详见官方文档:https://github.com/scalanlp/nak
算法细节详见参考
参考:A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
其他:
http://www.cs.fsu.edu/~ackerman/CIS5930/notes/DBSCAN.pdf
https://www.oreilly.com/ideas/clustering-geolocated-data-using-spark-and-dbscan
相关文章
- ☆打卡算法☆LeetCode 221. 最大正方形 算法解析
- 笛卡尔积算法「建议收藏」
- 【算法竞赛】愚蠢的错点
- 《算法竞赛进阶指南》0x14 Hash
- 详解九章算法的作者是谁_arrayset
- 【视频】K近邻KNN算法原理与R语言结合新冠疫情对股票价格预测|数据分享|附代码数据
- 线上分享 | 解决Feed流推荐中的多样性问题,小红书新算法入选KDD 2021
- 数据分享|R语言逻辑回归、Naive Bayes贝叶斯、决策树、随机森林算法预测心脏病|附代码数据
- 对话声网视频算法工程师郑林儒:视频质量评价方法的最优解
- Gradio入门到进阶全网最详细教程一:快速搭建AI算法可视化部署演示(侧重项目搭建和案例分享)
- Redis超时算法有效解决资源管理问题(redis超时算法)
- 只要社会存在偏见,即便是算法操控的机器也无法摘下有色眼镜
- 算法调整,难度再增7.3% 比特币挖矿难度持续“反弹”
- FDA召开公开会议,着重讨论医疗AI的数据多样性及算法偏见问题
- 基于一致性hash算法C++语言的实现详解
- 常用排序算法整理分享(快速排序算法、希尔排序)
- 又一个PHP实现的冒泡排序算法分享