华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (9): 59-67.doi: 10.12141/j.issn.1000-565X.240499

• 计算机科学与技术 • 上一篇    下一篇

基于双向文本扩展的信息检索重排方法

涂新辉, 郭聪, 宗宇航   

  1. 华中师范大学 计算机学院,湖北 武汉 430079
  • 收稿日期:2024-10-09 出版日期:2025-09-25 发布日期:2025-01-17
  • 通信作者: 郭聪(2001—),男,硕士生,主要从事自然语言处理和信息检索研究。 E-mail:guo_c@mails.ccnu.edu.cn
  • 作者简介:涂新辉(1979—),男,博士,副教授,主要从事自然语言处理和信息检索研究。E-mail: tuxinhui@ccnu.edu.cn
  • 基金资助:
    国家自然科学基金项目(62472192)

Information Retrieval Re-Ranking Method Based on Bidirectional Text Expansion

TU Xinhui, GUO Cong, ZONG Yuhang   

  1. School of Computer Science,Central China Normal University,Wuhan 430079,Hubei,China
  • Received:2024-10-09 Online:2025-09-25 Published:2025-01-17
  • Contact: 郭聪(2001—),男,硕士生,主要从事自然语言处理和信息检索研究。 E-mail:guo_c@mails.ccnu.edu.cn
  • About author:涂新辉(1979—),男,博士,副教授,主要从事自然语言处理和信息检索研究。E-mail: tuxinhui@ccnu.edu.cn
  • Supported by:
    the National Natural Science Foundation of China(62472192)

摘要:

随着大语言模型(LLM)的快速发展,信息检索中的文本匹配和文本扩展技术均取得了显著进展。查询扩展和文档扩展作为增强文本表征的2种重要方法,已广泛应用于现代信息检索系统中。目前,主流的文本扩展方法主要依赖大语言模型实现,然而这些模型生成的文本与人工创作的文本在语言多样性和风格上存在明显差异。这种差异可能会影响查询-文档相关性的计算准确度,最终导致整个信息检索系统的性能下降。为此,该文提出了一种基于双向文本扩展的信息检索重排方法(BTE-IRRM)。首先,采用零样本提示使大语言模型生成文档的伪查询和查询的伪文档;然后,计算伪查询和伪文档之间的语义相似度;最后,把原始查询-文档的相似度得分和伪查询-伪文档的语义相似度得分进行加权融合,得到最终的文档排序结果。为验证所提方法的有效性,该文在2个公开数据集(DL19和DL20)上进行了实验。结果表明,相比于现有基线方法,BTE-IRRM方法的多项评价指标均取得了显著提升。因此,该文提出的双向文本扩展方法能够进一步增强查询与文档之间的相关性匹配,从而提升整个信息检索系统的性能。

关键词: 信息检索, 大语言模型, 查询扩展, 文档扩展

Abstract:

With the rapid development of large language models (LLMs), remarkable progress has been made in both text matching and text expansion technologies in information retrieval. As two important methods for enhancing text representation, query expansion and document expansion have been widely applied in modern information retrieval systems. Currently, mainstream text expansion methods primarily rely on large language models. However, there are obvious differences in language diversity and style between the text generated by these models and the text created manually. These differences may affect the accuracy of calculating the query-document relevance, ultimately leading to a decline in the performance of the entire information retrieval system. To address this issue, this paper proposed an information retrieval re-ranking method based on bidirectional text expansion (BTE-IRRM). First, zero-shot prompting was used to enable the large language model to generate pseudo-queries for documents and pseudo-documents for queries. Then, the semantic similarity between these pseudo-queries and pseudo-documents was calculated. Finally, the similarity scores of the original query-document and the semantic similarity scores of the pseudo-query-pseudo-document were weighted and fused to obtain the final document ranking result. To validate the effectiveness of the proposed method, experiments were conducted on two public datasets (DL19 and DL20). Experimental results demonstrate that compared with the existing baseline methods, the BTE-IRRM method has achieved significant improvements in multiple evaluation indicators. Therefore, the bidirectional text expansion method proposed in this paper can further enhance the relevance matching between queries and documents, thereby improving the performance of the entire information retrieval system.

Key words: information retrieval, large language model, query expansion, document expansion

中图分类号: