基于网格密度和距离信息特征的聚类算法

华南理工大学学报（自然科学版） ›› 2009, Vol. 37 ›› Issue (4): 18-23,45.

基于网格密度和距离信息特征的聚类算法

戴维迪¹ 张璐² 王文俊¹ 侯越先¹

1. 天津大学计算机科学与技术学院, 天津 300072; 2. 天津大学软件学院, 天津 300072

收稿日期:2008-05-12 修回日期:2008-07-04 出版日期:2009-04-25 发布日期:2009-04-25
通信作者: 戴维迪（1977-），男，博士，副教授，主要从事数据挖掘、模式识别研究． E-mail:davidy@126．com
作者简介:戴维迪（1977-），男，博士，副教授，主要从事数据挖掘、模式识别研究．
基金资助:
国家自然科学基金资助项目（60603027）;天津市科技计划项目（08ZCKFGX01800,08ZCKFGX01600）

Clustering Algorithm Based on Grid Density and Distance Information Characteristics

Dai Wei-di¹ Zhang Lu² Wang Wen-jun¹ Hou Yue-xian¹

1. School of Computer Science and Technology, Tianjin University, Tianjin 300072, China; 2. School of Software, Tianjin University, Tianjin 300072, China

Received:2008-05-12 Revised:2008-07-04 Online:2009-04-25 Published:2009-04-25
Contact: 戴维迪（1977-），男，博士，副教授，主要从事数据挖掘、模式识别研究． E-mail:davidy@126．com
About author:戴维迪（1977-），男，博士，副教授，主要从事数据挖掘、模式识别研究．
Supported by:
国家自然科学基金资助项目（60603027）;天津市科技计划项目（08ZCKFGX01800,08ZCKFGX01600）

摘要/Abstract

摘要： 真实数据集通常密度分布不均，多数基于网格和密度的聚类算法采用的单调性搜索方法难以形成有效聚类．为此，文中提出了基于网格密度和距离信息特征的聚类算法（GDD）．该算法将数据空间划分成网格单元，并构建基于簇中心距离信息的跃迁函数，通过考察局域范围内网格单元的密度跃迁比，并比对计算出的当前网格单元的跃迁函数值，以决定是否继续扩展和增长聚类簇规模．具体的跃迁函数在真实和模拟集上的实验结果表明：GDD算法能够发现任意形状的簇，对噪音数据不敏感，且具有线性于网格数目的时间复杂性，适合对大规模真实数据集的聚类．

关键词: 聚类, 密度, 网格, 距离, 跃迁函数

Abstract:

When disposing of a real data set with skewed data distribution using most grid- and density-based clustering algorithms, effective clustering cannot be obtained due to the monotonic search employed in the algorithms. In order to solve this problem, a new clustering algorithm GDD based on grid density and distance is proposed. In GDD, the data space is divided into many grid cells and a transition function related to the distance from the current clustering center is constructed. Then, the density transition ratios of grid cells in the local area are compared with the computed transition function values of the current grid cell to determine whether the current cluster should be extended. Moreover, by using a transition function, some experiments are made with real and synthetic data sets. The results show that the proposed algorithm which is insensitive to noise data, can discover clusters with arbitrary shape, with a time complexity linear to grid number, and that the algorithm is suitable for the clustering of real large-scale data sets.

Key words: clustering, density, grid, distance, transition function

戴维迪张璐王文俊侯越先 . 基于网格密度和距离信息特征的聚类算法[J]. 华南理工大学学报（自然科学版）, 2009, 37(4): 18-23,45.

Dai Wei-di Zhang Lu Wang Wen-jun Hou Yue-xian . Clustering Algorithm Based on Grid Density and Distance Information Characteristics[J]. Journal of South China University of Technology (Natural Science Edition), 2009, 37(4): 18-23,45.

[1]	刘怡俊, 王嘉达, 钟仕杰, 等. 基于统一标签矩阵的快速多视图聚类[J]. 华南理工大学学报(自然科学版), 2023, 51(9): 110-119.
[2]	王学武, 方俊宇, 高进, 等. 基于改善解集分布性的多目标优化[J]. 华南理工大学学报(自然科学版), 2023, 51(8): 137-148.
[3]	林培群, 龚敏平, 周楚昊. 面向运输风险识别的高速公路货车用户画像方法[J]. 华南理工大学学报(自然科学版), 2023, 51(6): 1-9.
[4]	王永岗, 陈俊先, 郑少娅, 等. 道路急弯路段追尾冲突分析预测[J]. 华南理工大学学报(自然科学版), 2023, 51(4): 80-87.
[5]	吴娇蓉, 黄正文, 邓泳淇. 都市圈分层空间结构的交通网络密度发展规律[J]. 华南理工大学学报(自然科学版), 2023, 51(2): 111-121.
[6]	于斌, 张钰钦, 王羽尘, 等. 基于车载激光点云的道路几何信息自动化提取[J]. 华南理工大学学报(自然科学版), 2023, 51(2): 88-99.
[7]	胡兴华, 陈兴辉, 汪然, 等. 随机特性下考虑碳排放的公交优先控制优化模型[J]. 华南理工大学学报(自然科学版), 2023, 51(10): 160-170.
[8]	殷素红, 杨幸霖, 冯献, 等. 不同侧链密度PCE制备的C-S-H/PCE结构及其对水泥早期水化的影响[J]. 华南理工大学学报(自然科学版), 2023, 51(1): 76-83.
[9]	胡郁葱, 韦湖, 曾强. 基于空间广义有序Probit模型的高速公路事故严重程度分析[J]. 华南理工大学学报(自然科学版), 2023, 51(1): 114-122.
[10]	杨静雷, 孙寒冰, 李晓文, 等. 刚性气封局部气垫双体船波浪运动数值分析[J]. 华南理工大学学报(自然科学版), 2022, 50(9): 69-77.
[11]	周璇, 王馨瑶, 闫军威, 等. 基于机器学习的建筑复杂用能系统运行状态异常检测[J]. 华南理工大学学报(自然科学版), 2022, 50(7): 144-154.
[12]	陈廷照, 陈艳艳, 王子理, 等. “轨道交通微中心”理念下的慢行影响区范围确定方法[J]. 华南理工大学学报(自然科学版), 2022, 50(7): 56-65.
[13]	兰凤崇, 张越, 陈吉清, 等. 人车碰撞事故中行人伤亡风险的关联性分析与预测 [J]. 华南理工大学学报(自然科学版), 2022, 50(5): 1-10.
[14]	张驰, 任士鹏, 王博, 等. 长大下坡路段货车运行速度特性及预测[J]. 华南理工大学学报(自然科学版), 2022, 50(3): 38-49.
[15]	刘小兰, 石宗宇, 叶泽慧, 等. 基于锚点图的低秩缺失多视图子空间聚类[J]. 华南理工大学学报(自然科学版), 2022, 50(12): 60-70.