华南理工大学学报(自然科学版) ›› 2009, Vol. 37 ›› Issue (4): 18-23,45.

• 计算机科学与技术 • 上一篇    下一篇

基于网格密度和距离信息特征的聚类算法

戴维迪1  张璐2  王文俊1  侯越先1   

  1. 1. 天津大学 计算机科学与技术学院, 天津 300072; 2. 天津大学 软件学院, 天津 300072
  • 收稿日期:2008-05-12 修回日期:2008-07-04 出版日期:2009-04-25 发布日期:2009-04-25
  • 通信作者: 戴维迪(1977-),男,博士,副教授,主要从事数据挖掘、模式识别研究. E-mail:davidy@126.com
  • 作者简介:戴维迪(1977-),男,博士,副教授,主要从事数据挖掘、模式识别研究.
  • 基金资助:

    国家自然科学基金资助项目(60603027);天津市科技计划项目(08ZCKFGX01800,08ZCKFGX01600)

Clustering Algorithm Based on Grid Density and Distance Information Characteristics

Dai Wei-di1  Zhang Lu2  Wang Wen-jun1  Hou Yue-xian1   

  1. 1. School of Computer Science and Technology, Tianjin University, Tianjin 300072, China; 2. School of Software, Tianjin University, Tianjin 300072, China
  • Received:2008-05-12 Revised:2008-07-04 Online:2009-04-25 Published:2009-04-25
  • Contact: 戴维迪(1977-),男,博士,副教授,主要从事数据挖掘、模式识别研究. E-mail:davidy@126.com
  • About author:戴维迪(1977-),男,博士,副教授,主要从事数据挖掘、模式识别研究.
  • Supported by:

    国家自然科学基金资助项目(60603027);天津市科技计划项目(08ZCKFGX01800,08ZCKFGX01600)

摘要: 真实数据集通常密度分布不均,多数基于网格和密度的聚类算法采用的单调性搜索方法难以形成有效聚类.为此,文中提出了基于网格密度和距离信息特征的聚类算法(GDD).该算法将数据空间划分成网格单元,并构建基于簇中心距离信息的跃迁函数,通过考察局域范围内网格单元的密度跃迁比,并比对计算出的当前网格单元的跃迁函数值,以决定是否继续扩展和增长聚类簇规模.具体的跃迁函数在真实和模拟集上的实验结果表明:GDD算法能够发现任意形状的簇,对噪音数据不敏感,且具有线性于网格数目的时间复杂性,适合对大规模真实数据集的聚类.

关键词: 聚类, 密度, 网格, 距离, 跃迁函数

Abstract:

When disposing of a real data set with skewed data distribution using most grid- and density-based clustering algorithms, effective clustering cannot be obtained due to the monotonic search employed in the algorithms. In order to solve this problem, a new clustering algorithm GDD based on grid density and distance is proposed. In GDD, the data space is divided into many grid cells and a transition function related to the distance from the current clustering center is constructed. Then, the density transition ratios of grid cells in the local area are compared with the computed transition function values of the current grid cell to determine whether the current cluster should be extended. Moreover, by using a transition function, some experiments are made with real and synthetic data sets. The results show that the proposed algorithm which is insensitive to noise data, can discover clusters with arbitrary shape, with a time complexity linear to grid number, and that the algorithm is suitable for the clustering of real large-scale data sets.

Key words: clustering, density, grid, distance, transition function