华南理工大学学报(自然科学版) ›› 2022, Vol. 50 ›› Issue (6): 37-48,70.doi: 10.12141/j.issn.1000-565X.210124

所属专题: 2022年计算机科学与技术

• 计算机科学与技术 • 上一篇    下一篇

基于种子约束LDA的产品属性提取方法

陈可嘉 郑晶晶   

  1. 福州大学经济与管理学院
  • 收稿日期:2021-03-10 修回日期:2021-11-25 出版日期:2022-06-25 发布日期:2021-12-17
  • 通信作者: 陈可嘉 (1978-),男,博士,教授,主要从事文本挖掘、系统工程研究 E-mail:kjchen@ fzu. edu. cn
  • 作者简介:陈可嘉 (1978-),男,博士,教授,主要从事文本挖掘、系统工程研究
  • 基金资助:
    国家自然科学基金项目

Product Feature Extraction Method Based on Seed Constraint-LDA

CHEN Kejia ZHENG Jingjing   

  1. School of Economics and Management,Fuzhou University,Fuzhou 350116,Fujian,China
  • Received:2021-03-10 Revised:2021-11-25 Online:2022-06-25 Published:2021-12-17
  • Contact: 陈可嘉 (1978-),男,博士,教授,主要从事文本挖掘、系统工程研究 E-mail:kjchen@ fzu. edu. cn
  • About author:陈可嘉 (1978-),男,博士,教授,主要从事文本挖掘、系统工程研究
  • Supported by:
    Supported by the National Natural Science Foundation of China (71701019) and the National Social Science
    Foundation of China (19BTQ072)

摘要: 为了从评论中分类提取产品属性,使得评论能够按照不同产品属性分别进行展示,提高消费者作出购买决策的效率,本文提出基于种子约束LDA(Latent Dirichlet Allocation)的产品属性提取方法。首先利用TF-IDF(Term Frequency–Inverse Document Frequency)算法自动提取关键词,作为属性种子集;其次通过文档初次重组和二次重组的方式,解决长文本多属性类共现问题和短文本稀疏性问题,并提高文档重组率;然后应用must-link和cannot-link两种种子约束来定义概率扩缩值,影响LDA的主题分配,使得训练结果更加合理;最后将种子约束LDA生成的主题映射到先验属性类别上。本文从属性类别、属性词等方面进行定性分析,从准确率、熵值、纯度等方面进行定量分析,验证本文方法的优势。

关键词: 属性提取, LDA, 种子约束, 文档重组, 属性类别映射

Abstract: In order to classify and extract product features from reviews, make reviews displayed separately according to different product features, and improve the efficiency of making purchasing decisions for consumers, this paper proposes a product feature extraction method based on SC-LDA(Seed Constraint-Latent Dirichlet Allocation). Firstly, the TF-IDF (Term Frequency–Inverse Document Frequency) algorithm is used to automatically extract the keywords as a feature seed set. Secondly, document reorganization is adopted to solve the problem of multi-feature co-occurrence of the long text as well as sparsity of the short one and improve the rate of document reorganization. Then, must-link and cannot-link seed constraints are applied to define the probability expansion and contraction value, which affects the topic allocation of the LDA model and makes the training results more reasonable. Finally, the topics generated by SC-LDA are mapped to the prior feature categories. The advantages of the proposed method are verified by carrying out qualitative analysis in terms of feature categories as well as feature words and quantitative analysis in terms of accuracy, entropy as well as purity.

Key words: feature extraction, LDA, seed constraint, document reorganization, feature category mapping

中图分类号: