Journal of South China University of Technology (Natural Science Edition) ›› 2008, Vol. 36 ›› Issue (5): 30-37.

• Computer Science & Technology • Previous Articles     Next Articles

High-Efficiency Text Clustering Algorithm Based on Semantic Distance

Feng Shao-rong  Xiao Wen-jun    

  1. School of Computer Science and Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China
  • Received:2007-06-27 Revised:2007-09-03 Online:2008-05-25 Published:2008-05-25
  • Contact: 冯少荣(1964-),男,在职博士生,厦门大学副教授,主要从事并行分布数据库、数据仓库、数据挖掘方面的研究. E-mail:shaorong@xmu.edu.cn
  • About author:冯少荣(1964-),男,在职博士生,厦门大学副教授,主要从事并行分布数据库、数据仓库、数据挖掘方面的研究.
  • Supported by:

    国家自然科学基金资助项目(50474033)

Abstract:

As the existing text clustering algorithms overlook the semantic information between words and possess low calculation accuracy of text similarity,this paper proposes a new text clustering algorithm based on the semantic distance.In this method,the text is analyzed in terms of semantic,and the specific semantic of the text is used to calculate the similarity.Moreover,the nearest neighbor clustering algorithm is adopted,and a second clustering algorithm is presented to overcome the sensitivity of the nearest neighbor clustering algorithm to the input order of the text.According to the similarity weight,some feature words representing the cluster are chosen,which makes the remained feature words similar to the themes of the cluster.Experimental results indicate that the proposed algorithm is of higher clustering precision and recall rate,as compared with the k-Means algorithm based on the vector space model.

Key words: text clustering, semantic distance, similarity, nearest neighbor clustering, clustering algorithm