Computer Science & Technology

Subtopic Detection Algorithm Based on Hierarchical Clustering

Expand
  • 1. China Electronics Technology Group Corporation No. 10 Research Institute,Chengdu 610036,Sichuan,China; 2. College of Cybersecurity,Sichuan University,Chengdu 610065,Sichuan,China; 3. Cybersecurity Research Institute,Sichuan University,Chengdu 610065,Sichuan,China
代翔(1983-),男,高级工程师,主要从事自然语言处理、数据挖掘研究. E-mail:dai. xiang@ hotmail. com

Received date: 2018-04-27

  Revised date: 2019-04-17

  Online published: 2019-08-01

Supported by

Supported by the National Science and Technology Support Planning Project of China(2012BAH18B05),the Na- tional Natural Science Foundation of China(61272447,61802271,81602935),and the Planning Project of Science and Technology Department of Sichuan Province(16ZHSF0483)

Abstract

The topics generated by the topic model have the problem of semantic difference when using the latent Dirichlet allocation (LDA) to detect topic. The LDA model will generate generic topics with broad meanings,and the parameter K is set by human experience. These problems will lead to a mixed topic situation with multiple sub- topics in the modeling results. To solve the problems above,a subtopic detection algorithm was carried out by using a kind of document feature word sequence based on the hierarchical clustering algorithm. The algorithm was applied to solve the problems that the LDA model classification result is too coarse and low value of public opinion monito- ring caused by the generalization of hot topic detection results. Firstly,the LDA model's results were optimized by filtering two kinds of matrixs,i. e. ,the topic-word distribution and the document-word distribution. Then the over- lapping topics were detected and merged,and the generic topics and mixed topics were detected by using the densi- ty between documents. Finally,the hierarchical clustering algorithm was used to find the subtopics under a topic. The experimental results show that the detection of subtopics in this method can reflect the characteristics of hot topics at a deeper level,which is convenient for public opinion monitoring analysis. Compared with Single-Pass algo- rithm and K-means algorithm,the results obtained by this method are more effective. The selection strategy of K is robust to subtopic detection algorithms based on hierarchical clustering.

Cite this article

DAI Xiang HUANG Xifeng TANG Rui JIANG Mengting CHEN Xingshu WANG Haizhou LUO Liang . Subtopic Detection Algorithm Based on Hierarchical Clustering[J]. Journal of South China University of Technology(Natural Science), 2019 , 47(8) : 84 -95 . DOI: 10.12141/j.issn.1000-565X.180203

References

 
Outlines

/