华南理工大学学报(自然科学版) ›› 2019, Vol. 47 ›› Issue (8): 84-95.doi: 10.12141/j.issn.1000-565X.180203

• 计算机科学与技术 • 上一篇    下一篇

基于层次聚类的子话题检测算法

代翔1 黄细凤1 唐瑞2 蒋梦婷2 陈兴蜀2,3 王海舟2† 罗梁2   

  1. 1. 中国电子科技集团公司第十研究所,四川 成都 610036; 2. 四川大学 网络空间安全学院,四川 成都 610065; 3. 四川大学 网络空间安全研究院,四川 成都 610065
  • 收稿日期:2018-04-27 修回日期:2019-04-17 出版日期:2019-08-25 发布日期:2019-08-01
  • 通信作者: 王海舟(1986-),男,博士,副教授,主要从事网络安全、网络测量、数据挖掘、舆情监控研究. E-mail:whzh.nc@ scu.edu.cn
  • 作者简介:代翔(1983-),男,高级工程师,主要从事自然语言处理、数据挖掘研究. E-mail:dai. xiang@ hotmail. com
  • 基金资助:
    国家科技支撑计划项目(2012BAH18B05);国家自然科学基金资助项目(61272447,61802271,81602935);四川省 科技厅计划项目(16ZHSF0483)

Subtopic Detection Algorithm Based on Hierarchical Clustering

DAI Xiang1 HUANG Xifeng1 TANG Rui2 JIANG Mengting2 CHEN Xingshu2,3 WANG Haizhou2 LUO Liang2   

  1. 1. China Electronics Technology Group Corporation No. 10 Research Institute,Chengdu 610036,Sichuan,China; 2. College of Cybersecurity,Sichuan University,Chengdu 610065,Sichuan,China; 3. Cybersecurity Research Institute,Sichuan University,Chengdu 610065,Sichuan,China
  • Received:2018-04-27 Revised:2019-04-17 Online:2019-08-25 Published:2019-08-01
  • Contact: 王海舟(1986-),男,博士,副教授,主要从事网络安全、网络测量、数据挖掘、舆情监控研究. E-mail:whzh.nc@ scu.edu.cn
  • About author:代翔(1983-),男,高级工程师,主要从事自然语言处理、数据挖掘研究. E-mail:dai. xiang@ hotmail. com
  • Supported by:
    Supported by the National Science and Technology Support Planning Project of China(2012BAH18B05),the Na- tional Natural Science Foundation of China(61272447,61802271,81602935),and the Planning Project of Science and Technology Department of Sichuan Province(16ZHSF0483)

摘要: 使用隐狄利克雷分布(LDA)进行话题检测时,话题模型产生的话题存在语义上 的分层现象;LDA 建模产生的话题会出现语义上概括较广的泛话题;话题数目超参数 K 的设定通常根据人的经验. 这些将造成建模结果出现包含多个子话题的混合话题情况. 针 对上述问题,文中基于层次聚类算法,使用一种文档特征词序列对 LDA 模型分类结果粒 度过粗、热点话题检测结果泛化所导致的舆情监控价值较低的情况进行子话题检测. 首先 对 LDA 模型建模结果进行优化,对话题 - 单词分布与文档 - 单词分布两个矩阵进行过 滤;然后对重叠话题进行检测与合并,采用文档间紧密度度量方式发现泛话题与混合话 题;最后通过层次聚类算法对话题下的文本进行二次聚类,得到话题下的子话题. 实验结 果表明:该算法对子话题的检测能够在更深层次上体现出热点话题的特性,便于舆情监控 分析;与 Single-Pass 算法和 K-均值聚类算法相比,该算法获得的结果更具有有效性;K 的 选取策略对基于层次聚类的子话题检测算法具有鲁棒性.

关键词: 话题模型, 子话题, 层次聚类, 隐狄利克雷分布, 话题检测

Abstract: The topics generated by the topic model have the problem of semantic difference when using the latent Dirichlet allocation (LDA) to detect topic. The LDA model will generate generic topics with broad meanings,and the parameter K is set by human experience. These problems will lead to a mixed topic situation with multiple sub- topics in the modeling results. To solve the problems above,a subtopic detection algorithm was carried out by using a kind of document feature word sequence based on the hierarchical clustering algorithm. The algorithm was applied to solve the problems that the LDA model classification result is too coarse and low value of public opinion monito- ring caused by the generalization of hot topic detection results. Firstly,the LDA model's results were optimized by filtering two kinds of matrixs,i. e. ,the topic-word distribution and the document-word distribution. Then the over- lapping topics were detected and merged,and the generic topics and mixed topics were detected by using the densi- ty between documents. Finally,the hierarchical clustering algorithm was used to find the subtopics under a topic. The experimental results show that the detection of subtopics in this method can reflect the characteristics of hot topics at a deeper level,which is convenient for public opinion monitoring analysis. Compared with Single-Pass algo- rithm and K-means algorithm,the results obtained by this method are more effective. The selection strategy of K is robust to subtopic detection algorithms based on hierarchical clustering.

Key words: topic model, subtopic, hierarchical clustering, latent Dirichlet allocation, topic detection

中图分类号: