Journal of South China University of Technology(Natural Science Edition) ›› 2019, Vol. 47 ›› Issue (8): 84-95.doi: 10.12141/j.issn.1000-565X.180203

• Computer Science & Technology • Previous Articles     Next Articles

Subtopic Detection Algorithm Based on Hierarchical Clustering

DAI Xiang1 HUANG Xifeng1 TANG Rui2 JIANG Mengting2 CHEN Xingshu2,3 WANG Haizhou2 LUO Liang2   

  1. 1. China Electronics Technology Group Corporation No. 10 Research Institute,Chengdu 610036,Sichuan,China; 2. College of Cybersecurity,Sichuan University,Chengdu 610065,Sichuan,China; 3. Cybersecurity Research Institute,Sichuan University,Chengdu 610065,Sichuan,China
  • Received:2018-04-27 Revised:2019-04-17 Online:2019-08-25 Published:2019-08-01
  • Contact: 王海舟(1986-),男,博士,副教授,主要从事网络安全、网络测量、数据挖掘、舆情监控研究. E-mail:whzh.nc@ scu.edu.cn
  • About author:代翔(1983-),男,高级工程师,主要从事自然语言处理、数据挖掘研究. E-mail:dai. xiang@ hotmail. com
  • Supported by:
    Supported by the National Science and Technology Support Planning Project of China(2012BAH18B05),the Na- tional Natural Science Foundation of China(61272447,61802271,81602935),and the Planning Project of Science and Technology Department of Sichuan Province(16ZHSF0483)

Abstract: The topics generated by the topic model have the problem of semantic difference when using the latent Dirichlet allocation (LDA) to detect topic. The LDA model will generate generic topics with broad meanings,and the parameter K is set by human experience. These problems will lead to a mixed topic situation with multiple sub- topics in the modeling results. To solve the problems above,a subtopic detection algorithm was carried out by using a kind of document feature word sequence based on the hierarchical clustering algorithm. The algorithm was applied to solve the problems that the LDA model classification result is too coarse and low value of public opinion monito- ring caused by the generalization of hot topic detection results. Firstly,the LDA model's results were optimized by filtering two kinds of matrixs,i. e. ,the topic-word distribution and the document-word distribution. Then the over- lapping topics were detected and merged,and the generic topics and mixed topics were detected by using the densi- ty between documents. Finally,the hierarchical clustering algorithm was used to find the subtopics under a topic. The experimental results show that the detection of subtopics in this method can reflect the characteristics of hot topics at a deeper level,which is convenient for public opinion monitoring analysis. Compared with Single-Pass algo- rithm and K-means algorithm,the results obtained by this method are more effective. The selection strategy of K is robust to subtopic detection algorithms based on hierarchical clustering.

Key words: topic model, subtopic, hierarchical clustering, latent Dirichlet allocation, topic detection

CLC Number: