为了从评论中分类提取产品属性,使得评论能够按照不同产品属性分别进行展示,提高消费者作出购买决策的效率,本文提出基于种子约束LDA(Latent Dirichlet Allocation)的产品属性提取方法。首先利用TF-IDF(Term Frequency–Inverse Document Frequency)算法自动提取关键词,作为属性种子集;其次通过文档初次重组和二次重组的方式,解决长文本多属性类共现问题和短文本稀疏性问题,并提高文档重组率;然后应用must-link和cannot-link两种种子约束来定义概率扩缩值,影响LDA的主题分配,使得训练结果更加合理;最后将种子约束LDA生成的主题映射到先验属性类别上。本文从属性类别、属性词等方面进行定性分析,从准确率、熵值、纯度等方面进行定量分析,验证本文方法的优势。
In order to classify and extract product features from reviews, make reviews displayed separately according to different product features, and improve the efficiency of making purchasing decisions for consumers, this paper proposes a product feature extraction method based on SC-LDA(Seed Constraint-Latent Dirichlet Allocation). Firstly, the TF-IDF (Term Frequency–Inverse Document Frequency) algorithm is used to automatically extract the keywords as a feature seed set. Secondly, document reorganization is adopted to solve the problem of multi-feature co-occurrence of the long text as well as sparsity of the short one and improve the rate of document reorganization. Then, must-link and cannot-link seed constraints are applied to define the probability expansion and contraction value, which affects the topic allocation of the LDA model and makes the training results more reasonable. Finally, the topics generated by SC-LDA are mapped to the prior feature categories. The advantages of the proposed method are verified by carrying out qualitative analysis in terms of feature categories as well as feature words and quantitative analysis in terms of accuracy, entropy as well as purity.