华南理工大学学报(自然科学版) ›› 2019, Vol. 47 ›› Issue (2): 98-105.doi: 10.12141/j.issn.1000-565X.170550

• 计算机科学与技术 • 上一篇    下一篇

近邻密度分布优化样本分配的改进 DPC 聚类算法

纪霞1, 2 张涛1 朱建磊1 刘诗诚1 李学俊1, 2   

  1. 1. 安徽大学 计算机科学与技术学院,安徽 合肥 230601; 2. 安徽大学 计算智能与信号处理教育部重点实验室,安徽 合肥 230039
  • 收稿日期:2017-12-14 修回日期:2018-06-25 出版日期:2019-02-25 发布日期:2019-01-02
  • 通信作者: 纪霞( 1982-) ,女,博士,讲师,主要从事数据挖掘、机器学习和智能信息处理研究. E-mail:jixia1983@163.com
  • 作者简介:纪霞( 1982-) ,女,博士,讲师,主要从事数据挖掘、机器学习和智能信息处理研究.
  • 基金资助:
    国家自然科学基金资助项目( 61602004, 61672034) ; 安徽省自然科学基金资助项目( 1708085MF160, 1508085MF127, 1408085MF122) ;安徽省重点研究和研发计划项目( 1804d8020309) ;安徽省高校自然科学研究重点资助项目 ( KJ2016A041, KJ2017A011) ;安徽大学信息保障技术协同创新中心资助课题( ADXXBZ201605) 

Improved DPC Clustering Algorithm with Neighbor Density Distribution Optimized Sample Assignment
 

 JI Xia1, 2 ZHANG Tao1 ZHU Jianlei1 LIU Shicheng1 LI Xuejun1, 2    

  1.  1. School of Computer Science and Technology,Anhui University,Hefei 230601,Anhui,China; 2. Key Laboratory of Intelligent Computing and Signal Processing of the Ministry of Education,Anhui University,Hefei 230039,Anhui,China
  • Received:2017-12-14 Revised:2018-06-25 Online:2019-02-25 Published:2019-01-02
  • Contact: 纪霞( 1982-) ,女,博士,讲师,主要从事数据挖掘、机器学习和智能信息处理研究. E-mail:jixia1983@163.com
  • About author:纪霞( 1982-) ,女,博士,讲师,主要从事数据挖掘、机器学习和智能信息处理研究.
  • Supported by:
     Supported by the National Natural Science Foundation of China( 61602004, 61672034) , the Natural Science Foundation of Anhui Province( 1708085MF160, 1508085MF127, 1408085MF122) , the Key Research and Development Program of Anhui Province( 1804d8020309) and the Natural Science Foundation of Anhui Higher Education Institutions ( KJ2016A041, KJ2017A011) 

摘要: DPC 算法是一种能够自动确定类簇数和类簇中心的新型密度聚类算法,但在样 本分配策略上存在聚类质量不稳定的缺陷. 其改进算法 KNN-DPC 虽然具有较好的聚类 效果,但效率不高而影响实用. 针对以上问题,文中提出了一种近邻密度分布优化的 DPC 算法. 该算法在 DPC 算法搜索和发现样本的初始类簇中心的基础上,基于样本的密度分 布采用两种样本类簇分配策略,依次将各样本分配到相应的类簇. 理论分析和在经典人工 数据集以及 UCI 真实数据集上的实验结果表明: 文中提出的聚类算法能快速确定任意形 状数据的类簇中心和有效地进行样本类簇分配;与 DPC 算法和 KNN-DPC 算法相比,文中 算法在聚类效果与时间性能上有更好的平衡,聚类稳定性高,可适用于大规模数据集的自 适应聚类分析. 

关键词: DPC 算法, 近邻, 密度分布, 聚类 

Abstract: DPC algorithm is a new density based clustering algorithm that can automatically determine the number of clusters and cluster centers. However, there is a defect in the stability of clustering quality in the sample allocation strategy. KNN-DPC,an improved algorithm of DPC,has better clustering effect,but its practicality is affected by the low efficiency. In order to overcome the deficiencies of DPC algorithm and KNN-DPC algorithm,a neighbor density distribution optimized DPC clustering algorithm was proposed. Firstly, the algorithm searched and found the cluster centers with DPC algorithm. Then, two sample allocation strategies were adopted based on the neighbor density distribution of the sample,which was in turn used to assign the rest samples to the corresponding cluster. Theoretical analysis and the thorough experiments on several popular test cases include synthetic datasets and real-world datasets from UCI machine learning repository show that the clustering algorithm proposed can quickly determine the cluster center of arbitrary shape data and effectively perform sample cluster allocation. Compared with DPC algorithm and KNN-DPC algorithm, the proposed algorithm has a better balance between clustering effect and time performance,and has high stability. The algorithm proposed is an effective adaptive clustering algorithm that can be applied to largescale data sets. 

Key words:

 

中图分类号: