华南理工大学学报(自然科学版) ›› 2017, Vol. 45 ›› Issue (3): 48-53.doi: 10.3969/j.issn.1000-565X.2017.03.007

• 计算机科学与技术 • 上一篇    下一篇

基于主题模型的资源选择算法

董守斌 谢一帆 袁华陈建豪   

  1. 华南理工大学计算机科学与工程学院//广东省计算机网络重点实验室,广东广州510006
  • 收稿日期:2016-11-27 出版日期:2017-03-25 发布日期:2017-02-02
  • 通信作者: 袁华( 1969-) ,女,博士,副教授,主要从事信息检索研究. E-mail:hyuan@scut.edu.cn
  • 作者简介:董守斌( 1967-) ,女,博士,教授,主要从事信息检索与高性能计算研究. E-mail: sbdong@ scut. edu. cn
  • 基金资助:
    广东省自然科学基金重大基础研究培育项目( 2015A030308017 ) ; 教育部中国移动科研基金资助项目( MCM20150512)

Resource Selection Algorithm on the Basis of Topic Model

DONG Shou-bin XIE Yi-fan YUAN Hua CHEN Jian-hao   

  1. School of Computer Science and Engineering / /Computation &Computer Network Laboratory of Guangdong Province,South China University of Technology,Guangzhou 510006,Guangdong,China
  • Received:2016-11-27 Online:2017-03-25 Published:2017-02-02
  • Contact: 袁华( 1969-) ,女,博士,副教授,主要从事信息检索研究. E-mail:hyuan@scut.edu.cn
  • About author:董守斌( 1967-) ,女,博士,教授,主要从事信息检索与高性能计算研究. E-mail: sbdong@ scut. edu. cn
  • Supported by:
    Supported by the Significant Fundamental Cultivate Project of Guangdong Province Natural Science Foundation( 2015A030308017) and the Scientific Research Joint Funds of Ministry of Education of China and China Mobile( MCM20150512)

摘要: 在具有多个真实搜索引擎的联邦检索环境下,基于小文档的资源选择算法由于难以估计每个搜索引擎的真实网页数量,因此准确率较低. 针对这个问题,文中提出了基于主题模型的资源库描述方法,利用LDA 主体模型获取每个资源库的描述词; 在此基础上提出新的资源选择算法,结合垂直领域权重和词向量计算资源库和查询请求之间的相关度,并根据相关度大小获取最终资源选择结果. 实验结果表明,基于主题模型的资源选择算法能很好地提高资源选择效果,可有效应用于分布式搜索引擎的联邦检索环境.

关键词: 分布式检索, 资源选择, 主题模型, 垂直领域, 词向量

Abstract: In the federated search environment with multiple real search engines,the small-document approach,which is inefficient in estimating the accurate number of indexed files in the process of resource description,may result in poor performance of resource selection methods.In order to solve this problem,a resource library description method on the basis of topic model is proposed,which adopts LDA topic model to obtain the description word of each resource library.Then,a new resource selection algorithm is proposed,which combines with both vertical weight and word vector to calculate the correlation between resource library and query request,and to obtain the final resource selection results according to the correlation.Experimental results show that the proposed resource selection algorithm on the basis of topic model improves the performance of resource selection and can be effectively applied in the federated search environment of distributed search engines.

Key words: distributed search, resource selection, topic model, vertical domain, word vector