近邻密度分布优化样本分配的改进 DPC 聚类算法

doi:10.12141/j.issn.1000-565X.170550

华南理工大学学报(自然科学版) ›› 2019, Vol. 47 ›› Issue (2): 98-105.doi: 10.12141/j.issn.1000-565X.170550

近邻密度分布优化样本分配的改进 DPC 聚类算法

纪霞^{1， 2} 张涛¹ 朱建磊¹ 刘诗诚¹ 李学俊^{1， 2}

1．安徽大学计算机科学与技术学院，安徽合肥 230601; 2．安徽大学计算智能与信号处理教育部重点实验室，安徽合肥 230039

收稿日期:2017-12-14 修回日期:2018-06-25 出版日期:2019-02-25 发布日期:2019-01-02
通信作者: 纪霞( 1982-) ，女，博士，讲师，主要从事数据挖掘、机器学习和智能信息处理研究． E-mail:jixia1983@163.com
作者简介:纪霞( 1982-) ，女，博士，讲师，主要从事数据挖掘、机器学习和智能信息处理研究．
基金资助:
国家自然科学基金资助项目( 61602004， 61672034) ; 安徽省自然科学基金资助项目( 1708085MF160， 1508085MF127， 1408085MF122) ;安徽省重点研究和研发计划项目( 1804d8020309) ;安徽省高校自然科学研究重点资助项目 ( KJ2016A041， KJ2017A011) ;安徽大学信息保障技术协同创新中心资助课题( ADXXBZ201605)

Improved DPC Clustering Algorithm with Neighbor Density Distribution Optimized Sample Assignment

JI Xia^{1， 2} ZHANG Tao¹ ZHU Jianlei¹ LIU Shicheng¹ LI Xuejun^{1， 2}

1． School of Computer Science and Technology，Anhui University，Hefei 230601，Anhui，China; 2． Key Laboratory of Intelligent Computing and Signal Processing of the Ministry of Education，Anhui University，Hefei 230039，Anhui，China

Received:2017-12-14 Revised:2018-06-25 Online:2019-02-25 Published:2019-01-02
Contact: 纪霞( 1982-) ，女，博士，讲师，主要从事数据挖掘、机器学习和智能信息处理研究． E-mail:jixia1983@163.com
About author:纪霞( 1982-) ，女，博士，讲师，主要从事数据挖掘、机器学习和智能信息处理研究．
Supported by:
Supported by the National Natural Science Foundation of China( 61602004， 61672034) ， the Natural Science Foundation of Anhui Province( 1708085MF160， 1508085MF127， 1408085MF122) ， the Key Ｒesearch and Development Program of Anhui Province( 1804d8020309) and the Natural Science Foundation of Anhui Higher Education Institutions ( KJ2016A041， KJ2017A011)

摘要/Abstract

摘要： DPC 算法是一种能够自动确定类簇数和类簇中心的新型密度聚类算法，但在样本分配策略上存在聚类质量不稳定的缺陷．其改进算法 KNN-DPC 虽然具有较好的聚类效果，但效率不高而影响实用．针对以上问题，文中提出了一种近邻密度分布优化的 DPC 算法．该算法在 DPC 算法搜索和发现样本的初始类簇中心的基础上，基于样本的密度分布采用两种样本类簇分配策略，依次将各样本分配到相应的类簇．理论分析和在经典人工数据集以及 UCI 真实数据集上的实验结果表明: 文中提出的聚类算法能快速确定任意形状数据的类簇中心和有效地进行样本类簇分配;与 DPC 算法和 KNN-DPC 算法相比，文中算法在聚类效果与时间性能上有更好的平衡，聚类稳定性高，可适用于大规模数据集的自适应聚类分析．

关键词: DPC 算法, 近邻, 密度分布, 聚类

Abstract: DPC algorithm is a new density based clustering algorithm that can automatically determine the number of clusters and cluster centers． However， there is a defect in the stability of clustering quality in the sample allocation strategy． KNN-DPC，an improved algorithm of DPC，has better clustering effect，but its practicality is affected by the low efficiency． In order to overcome the deficiencies of DPC algorithm and KNN-DPC algorithm，a neighbor density distribution optimized DPC clustering algorithm was proposed． Firstly， the algorithm searched and found the cluster centers with DPC algorithm． Then， two sample allocation strategies were adopted based on the neighbor density distribution of the sample，which was in turn used to assign the rest samples to the corresponding cluster． Theoretical analysis and the thorough experiments on several popular test cases include synthetic datasets and real-world datasets from UCI machine learning repository show that the clustering algorithm proposed can quickly determine the cluster center of arbitrary shape data and effectively perform sample cluster allocation． Compared with DPC algorithm and KNN-DPC algorithm， the proposed algorithm has a better balance between clustering effect and time performance，and has high stability． The algorithm proposed is an effective adaptive clustering algorithm that can be applied to largescale data sets．

Key words:

DPC algorithm, neighbor, density distribution, clustering

中图分类号:

TP301.6

纪霞张涛朱建磊刘诗诚李学俊. 近邻密度分布优化样本分配的改进 DPC 聚类算法 [J]. 华南理工大学学报(自然科学版), 2019, 47(2): 98-105.

JI Xia ZHANG Tao ZHU Jianlei LIU Shicheng LI Xuejun.

Improved DPC Clustering Algorithm with Neighbor Density Distribution Optimized Sample Assignment

[J]. Journal of South China University of Technology(Natural Science Edition), 2019, 47(2): 98-105.

[1]	唐成, 王端宜, 贠迪, 等. 沥青混合料骨架细观接触的高通量计算[J]. 华南理工大学学报(自然科学版), 2023, 51(4): 135-144.
[2]	相恒永, 周莉, 巴晓辉, 等. 基于动态窗口运动统计信息的特征匹配筛选算法[J]. 华南理工大学学报（自然科学版）, 2020, 48(6): 114-122.
[3]	林培群陈丽甜雷永巍. 基于K近邻模式匹配的地铁客流量短时预测[J]. 华南理工大学学报（自然科学版）, 2018, 46(1): 50-57.
[4]	商强林赐云杨兆升邴其春田秀娟王树兴. 基于谱聚类与RS-KNN的城市快速路交通状态判别[J]. 华南理工大学学报（自然科学版）, 2017, 45(6): 52-58.
[5]	傅予力杨帅陈培林黄志建唐杰. 室内区域性 WiFi 定位 EKNN 算法设计[J]. 华南理工大学学报（自然科学版）, 2017, 45(10): 87-92,99.
[6]	刘树青徐建闽卢凯马莹莹. 用于交通流预测的带距离权重模式识别算法[J]. 华南理工大学学报（自然科学版）, 2015, 43(12): 114-118,126.
[7]	廖秀秀韩国强沃焱陈湘骥. 基于近邻嵌入逐级放大的图像超分辨率重建[J]. 华南理工大学学报（自然科学版）, 2013, 41(5): 55-60.
[8]	梁茹冰刘琼. 公路网移动终端的KNN 查询技术[J]. 华南理工大学学报(自然科学版), 2012, 40(1): 138-145,158.
[9]	冯少荣肖文俊. 基于语义距离的高效文本聚类算法[J]. 华南理工大学学报（自然科学版）, 2008, 36(5): 30-37.
[10]	杨苹陈武. 基于自适应最优模糊逻辑系统的移动通信话务预测[J]. 华南理工大学学报（自然科学版）, 2005, 33(12): 66-69.
[11]	薛家祥贾林易志平. CO₂焊过程电信号的统计分析[J]. 华南理工大学学报(自然科学版), 2003, 31(4): 37-40.
[12]	姚若河张晓东. Tokamak 中中性粒子注入的 Monte-Carlo 模拟[J]. 华南理工大学学报(自然科学版), 2003, 31(11): 65-68.

近邻密度分布优化样本分配的改进 DPC 聚类算法

Improved DPC Clustering Algorithm with Neighbor Density Distribution Optimized Sample Assignment

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

编辑推荐

Metrics

本文评价