基于层次聚类的子话题检测算法

doi:10.12141/j.issn.1000-565X.180203

华南理工大学学报(自然科学版) ›› 2019, Vol. 47 ›› Issue (8): 84-95.doi: 10.12141/j.issn.1000-565X.180203

基于层次聚类的子话题检测算法

代翔¹ 黄细凤¹ 唐瑞²蒋梦婷²陈兴蜀^2,3王海舟^2† 罗梁²

1．中国电子科技集团公司第十研究所，四川成都 610036; 2．四川大学网络空间安全学院，四川成都 610065; 3．四川大学网络空间安全研究院，四川成都 610065

收稿日期:2018-04-27 修回日期:2019-04-17 出版日期:2019-08-25 发布日期:2019-08-01
通信作者: 王海舟(1986-)，男，博士，副教授，主要从事网络安全、网络测量、数据挖掘、舆情监控研究． E-mail:whzh.nc@ scu.edu.cn
作者简介:代翔(1983-)，男，高级工程师，主要从事自然语言处理、数据挖掘研究． E-mail:dai． xiang@ hotmail． com
基金资助:
国家科技支撑计划项目(2012BAH18B05);国家自然科学基金资助项目(61272447，61802271，81602935);四川省科技厅计划项目(16ZHSF0483)

Subtopic Detection Algorithm Based on Hierarchical Clustering

DAI Xiang¹ HUANG Xifeng¹ TANG Rui² JIANG Mengting² CHEN Xingshu^2,3 WANG Haizhou² LUO Liang²

1． China Electronics Technology Group Corporation No. 10 Research Institute，Chengdu 610036，Sichuan，China; 2． College of Cybersecurity，Sichuan University，Chengdu 610065，Sichuan，China; 3． Cybersecurity Research Institute，Sichuan University，Chengdu 610065，Sichuan，China

Received:2018-04-27 Revised:2019-04-17 Online:2019-08-25 Published:2019-08-01
Contact: 王海舟(1986-)，男，博士，副教授，主要从事网络安全、网络测量、数据挖掘、舆情监控研究． E-mail:whzh.nc@ scu.edu.cn
About author:代翔(1983-)，男，高级工程师，主要从事自然语言处理、数据挖掘研究． E-mail:dai． xiang@ hotmail． com
Supported by:
Supported by the National Science and Technology Support Planning Project of China(2012BAH18B05)，the Na- tional Natural Science Foundation of China(61272447，61802271，81602935)，and the Planning Project of Science and Technology Department of Sichuan Province(16ZHSF0483)

摘要/Abstract

摘要： 使用隐狄利克雷分布(LDA)进行话题检测时，话题模型产生的话题存在语义上的分层现象;LDA 建模产生的话题会出现语义上概括较广的泛话题;话题数目超参数 K 的设定通常根据人的经验．这些将造成建模结果出现包含多个子话题的混合话题情况．针对上述问题，文中基于层次聚类算法，使用一种文档特征词序列对 LDA 模型分类结果粒度过粗、热点话题检测结果泛化所导致的舆情监控价值较低的情况进行子话题检测．首先对 LDA 模型建模结果进行优化，对话题－单词分布与文档－单词分布两个矩阵进行过滤;然后对重叠话题进行检测与合并，采用文档间紧密度度量方式发现泛话题与混合话题;最后通过层次聚类算法对话题下的文本进行二次聚类，得到话题下的子话题．实验结果表明:该算法对子话题的检测能够在更深层次上体现出热点话题的特性，便于舆情监控分析;与 Single-Pass 算法和 K-均值聚类算法相比，该算法获得的结果更具有有效性;K 的选取策略对基于层次聚类的子话题检测算法具有鲁棒性．

关键词: 话题模型, 子话题, 层次聚类, 隐狄利克雷分布, 话题检测

Abstract: The topics generated by the topic model have the problem of semantic difference when using the latent Dirichlet allocation (LDA) to detect topic． The LDA model will generate generic topics with broad meanings，and the parameter K is set by human experience． These problems will lead to a mixed topic situation with multiple sub- topics in the modeling results． To solve the problems above，a subtopic detection algorithm was carried out by using a kind of document feature word sequence based on the hierarchical clustering algorithm． The algorithm was applied to solve the problems that the LDA model classification result is too coarse and low value of public opinion monito- ring caused by the generalization of hot topic detection results． Firstly，the LDA model's results were optimized by filtering two kinds of matrixs，i. e．，the topic-word distribution and the document-word distribution． Then the over- lapping topics were detected and merged，and the generic topics and mixed topics were detected by using the densi- ty between documents． Finally，the hierarchical clustering algorithm was used to find the subtopics under a topic． The experimental results show that the detection of subtopics in this method can reflect the characteristics of hot topics at a deeper level，which is convenient for public opinion monitoring analysis． Compared with Single-Pass algo- rithm and K-means algorithm，the results obtained by this method are more effective． The selection strategy of K is robust to subtopic detection algorithms based on hierarchical clustering．

Key words: topic model, subtopic, hierarchical clustering, latent Dirichlet allocation, topic detection

中图分类号:

TP391.1

代翔黄细凤唐瑞蒋梦婷陈兴蜀王海舟罗梁. 基于层次聚类的子话题检测算法[J]. 华南理工大学学报(自然科学版), 2019, 47(8): 84-95.

DAI Xiang HUANG Xifeng TANG Rui JIANG Mengting CHEN Xingshu WANG Haizhou LUO Liang. Subtopic Detection Algorithm Based on Hierarchical Clustering[J]. Journal of South China University of Technology(Natural Science Edition), 2019, 47(8): 84-95.

[1]	王学武, 方俊宇, 高进, 等. 基于改善解集分布性的多目标优化[J]. 华南理工大学学报(自然科学版), 2023, 51(8): 137-148.
[2]	李静远丘志杰刘悦程学旗任彦 . 抑制背景噪声的LDA 子话题挖掘算法[J]. 华南理工大学学报（自然科学版）, 2017, 45(3): 54-60.
[3]	陈兴蜀高悦江浩杜敏王海舟何建云. 基于 OLDA 的热点话题演化跟踪模型[J]. 华南理工大学学报（自然科学版）, 2016, 44(5): 130-136.

基于层次聚类的子话题检测算法

Subtopic Detection Algorithm Based on Hierarchical Clustering

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

编辑推荐

Metrics

本文评价