Journal of South China University of Technology (Natural Science Edition) ›› 2008, Vol. 36 ›› Issue (9): 37-42.

• Computer Science & Technology • Previous Articles     Next Articles

Topic-Based Document Retrieval Model

Jia Xi-ping  Peng Hong  Zheng Qi-lun  Shi Shi-xu  Jiang Zhuo-lin     

  1. School of Computer Science and Engineering, South China University of Technology, Guangzhou 510640, Guangdong, China
  • Received:2008-01-11 Revised:2008-04-02 Online:2008-09-25 Published:2008-09-25
  • Contact: 贾西平(1976-),男,博士生,主要从事自然语言处理、数据挖掘研究. E-mail:jiaxp@126.com
  • About author:贾西平(1976-),男,博士生,主要从事自然语言处理、数据挖掘研究.
  • Supported by:

    广东省自然科学基金资助项目(07006474);广东省科技攻关项目(2007B010200044)

Abstract:

As most existing document retrieval models are inefficient in semantic learning and are unable to learn the document similarity in topic level, a topic-based document retrieval model (TDRM) is p TDRM provides a common topic space for all documents, represents each document as a vector in the common space, defines the document similarity as the cosine of the angle between document vectors, and uses Latent Dirichlet Allocation to learn the topic distribution of each document. Experimental results show that, as compared with the document similarity model based on the TextTiling and the optimal matching of bipartite graph, TDRM is of higher average precision and recall in the retrieval of similar document, with its harmonic mean of average precision and recall being 44% greater than that of the reference model.

Key words: topic, document similarity, document retrieval, information retrieval, data mining