基于层次聚类的子话题检测算法

代翔 黄细凤 唐瑞 蒋梦婷 陈兴蜀 王海舟 罗梁

doi:10.12141/j.issn.1000-565X.180203

华南理工大学学报(自然科学版) >

2019 , Vol. 47 >Issue 8: 84 - 95

DOI: https://doi.org/10.12141/j.issn.1000-565X.180203

计算机科学与技术

基于层次聚类的子话题检测算法

展开

1．中国电子科技集团公司第十研究所，四川成都 610036; 2．四川大学网络空间安全学院，四川成都 610065; 3．四川大学网络空间安全研究院，四川成都 610065

代翔(1983-)，男，高级工程师，主要从事自然语言处理、数据挖掘研究． E-mail:dai． xiang@ hotmail． com

收稿日期: 2018-04-27

修回日期: 2019-04-17

网络出版日期: 2019-08-01

基金资助

国家科技支撑计划项目(2012BAH18B05);国家自然科学基金资助项目(61272447，61802271，81602935);四川省科技厅计划项目(16ZHSF0483)

收起

Subtopic Detection Algorithm Based on Hierarchical Clustering

Expand

1． China Electronics Technology Group Corporation No. 10 Research Institute，Chengdu 610036，Sichuan，China; 2． College of Cybersecurity，Sichuan University，Chengdu 610065，Sichuan，China; 3． Cybersecurity Research Institute，Sichuan University，Chengdu 610065，Sichuan，China

代翔(1983-)，男，高级工程师，主要从事自然语言处理、数据挖掘研究． E-mail:dai． xiang@ hotmail． com

Received date: 2018-04-27

Revised date: 2019-04-17

Online published: 2019-08-01

Supported by

Supported by the National Science and Technology Support Planning Project of China(2012BAH18B05)，the Na- tional Natural Science Foundation of China(61272447，61802271，81602935)，and the Planning Project of Science and Technology Department of Sichuan Province(16ZHSF0483)

Fold

摘要

使用隐狄利克雷分布(LDA)进行话题检测时，话题模型产生的话题存在语义上的分层现象;LDA 建模产生的话题会出现语义上概括较广的泛话题;话题数目超参数 K 的设定通常根据人的经验．这些将造成建模结果出现包含多个子话题的混合话题情况．针对上述问题，文中基于层次聚类算法，使用一种文档特征词序列对 LDA 模型分类结果粒度过粗、热点话题检测结果泛化所导致的舆情监控价值较低的情况进行子话题检测．首先对 LDA 模型建模结果进行优化，对话题－单词分布与文档－单词分布两个矩阵进行过滤;然后对重叠话题进行检测与合并，采用文档间紧密度度量方式发现泛话题与混合话题;最后通过层次聚类算法对话题下的文本进行二次聚类，得到话题下的子话题．实验结果表明:该算法对子话题的检测能够在更深层次上体现出热点话题的特性，便于舆情监控分析;与 Single-Pass 算法和 K-均值聚类算法相比，该算法获得的结果更具有有效性;K 的选取策略对基于层次聚类的子话题检测算法具有鲁棒性．

关键词： 话题模型; 子话题; 层次聚类; 隐狄利克雷分布; 话题检测

本文引用格式

代翔黄细凤唐瑞蒋梦婷陈兴蜀王海舟罗梁 . 基于层次聚类的子话题检测算法[J]. 华南理工大学学报(自然科学版), 2019 , 47(8) : 84 -95 . DOI: 10.12141/j.issn.1000-565X.180203

Abstract

The topics generated by the topic model have the problem of semantic difference when using the latent Dirichlet allocation (LDA) to detect topic． The LDA model will generate generic topics with broad meanings，and the parameter K is set by human experience． These problems will lead to a mixed topic situation with multiple sub- topics in the modeling results． To solve the problems above，a subtopic detection algorithm was carried out by using a kind of document feature word sequence based on the hierarchical clustering algorithm． The algorithm was applied to solve the problems that the LDA model classification result is too coarse and low value of public opinion monito- ring caused by the generalization of hot topic detection results． Firstly，the LDA model's results were optimized by filtering two kinds of matrixs，i. e．，the topic-word distribution and the document-word distribution． Then the over- lapping topics were detected and merged，and the generic topics and mixed topics were detected by using the densi- ty between documents． Finally，the hierarchical clustering algorithm was used to find the subtopics under a topic． The experimental results show that the detection of subtopics in this method can reflect the characteristics of hot topics at a deeper level，which is convenient for public opinion monitoring analysis． Compared with Single-Pass algo- rithm and K-means algorithm，the results obtained by this method are more effective． The selection strategy of K is robust to subtopic detection algorithms based on hierarchical clustering．

Key words： topic model; subtopic; hierarchical clustering; latent Dirichlet allocation; topic detection

参考文献

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献