基于邻居信息聚合的无配对跨模态检索重排序

沃焱; 梁展扬

doi:10.12141/j.issn.1000-565X.240598

华南理工大学学报(自然科学版) >

2025 , Vol. 53 >Issue 11: 18 - 26

DOI: https://doi.org/10.12141/j.issn.1000-565X.240598

计算机科学与技术

基于邻居信息聚合的无配对跨模态检索重排序

沃焱 ,
梁展扬

展开

华南理工大学计算机科学与工程学院，广东广州 510006

沃焱（1975—），女，博士，教授，主要从事多媒体应用技术研究。E-mail：woyan@scut.edu.cn

收稿日期: 2024-12-25

网络出版日期: 2025-06-03

基金资助

广东省自然科学基金项目(2025A1515011905)

收起

Unpaired Cross-Modal Retrieval Re-Ranking Based on Neighbor Information Aggregation

WO Yan ,
LIANG Zhanyang

Expand

School of Computer Science and Engineering，South China University of Technology，Guangzhou 510006，Guangdong，China

沃焱（1975—），女，博士，教授，主要从事多媒体应用技术研究。E-mail：woyan@scut.edu.cn

Received date: 2024-12-25

Online published: 2025-06-03

Supported by

the Natural Science Foundation of Guangdong Province(2025A1515011905)

Fold

摘要

重排序方法作为一种后处理技术，在跨模态检索任务中展现出了显著的效果，它通过挖掘、处理初始排序列表之间的信息，有效提高了检索的准确性。当前主流的跨模态检索重排序方法是在数据集有配对的情况下对初始列表进行重排序，灵活性差，使用时需对原来的框架进行修改并重新训练，无法灵活地迁移到其他框架上；此外，它们无法应用于无配对情景。依赖于大规模配对数据集，跨模态检索目前取得了显著的进展，但忽视了实际场景中标注大规模数据集需耗费大量资源的问题。鉴于此，该文提出了一种基于邻居信息聚合的无配对跨模态检索重排序方法。该方法通过挖掘并利用样本的邻居信息，使错误的答案远离查询输入；通过搜索欧氏邻域中的局部邻居，并基于协同表达搜索全局邻居表达样本的邻居信息，将这两种邻居信息加以融合生成新特征，再重新计算与检索输入的语义相似性，完成重排序。将该方法置于多种跨模态检索框架作为后处理方法，并在MSCOCO数据集上进行实验，结果证明了该方法的有效性以及相对于其他重排序方法的优越性。

关键词： 跨模态检索; 重排序方法; 邻居信息聚合; 全局语义邻居; 局部语义邻居

本文引用格式

沃焱 , 梁展扬 . 基于邻居信息聚合的无配对跨模态检索重排序[J]. 华南理工大学学报(自然科学版), 2025 , 53(11) : 18 -26 . DOI: 10.12141/j.issn.1000-565X.240598

Abstract

As a post-processing technique, re-ranking has demonstrated significant effectiveness in cross-modal retrieval tasks. By mining and processing the information between initial ranking lists, re-ranking process effectively improves retrieval accuracy. The current mainstream cross-modal retrieval re-ranking methods re-rank the initial list based on paired datasets. However, they have poor flexibility because they cannot be easily plugged into existing systems without modifying the original framework and retraining, which makes it difficult to transfer them to other frameworks. Moreover, they cannot be applied in unpaired scenarios. At present, cross-modal retrieval has achieved significant progress by relying on large-scale paired datasets, but it overlooks the problem that labeling such large-scale datasets in practical scenarios requires substantial resources. To address these issues, this paper proposes an unpaired cross-modal retrieval re-ranking method based on neighbor information aggregation. The method improves retrieval performance by mining and utilizing the neighbor information of samples, pushing incorrect answers away from the query input. It searches for local neighbors in the Euclidean neighborhood and for global neighbor expressions through collaborative expression, and then integrates these two types of neighbor information to generate new features for re-calculating semantic similarity with the retrieval input, thus completing a re-ranking process. Finally, the proposed method is applied as a post-processing technique in several cross-modal retrieval model frameworks and is tested on MSCOCO dataset, with its effectiveness and superiority over other re-ranking methods being demonstrated.

Key words： cross-modal retrieval; re-ranking method; neighbor information aggregation; global semantic neighbor; local semantic neighbor

参考文献

[1]	WANG L， LI Y， LAZEBNIK S ．Learning deep structure-preserving image-text embeddings［C］∥Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition．Seattle：IEEE，2016：5005-5013．
[2]	ZHONG Z， ZHENG L， CAO D，et al ．Re-ranking person re-identification with k-reciprocal encoding［C］∥Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition．Honolulu：IEEE，2017：1318-1327．
[3]	SHAO S， CHEN K， KARPUR A，et al ．Global features are all you need for image retrieval and reranking［C］∥Proceedings of 2023 IEEE/CVF International Conference on Computer Vision．Paris：IEEE，2023：11036-11046．
[4]	WEI W， JIANG M， ZHANG X，et al ．Boosting cross-modal retrieval with MVSE++ and reciprocal neighbors［J］．IEEE Access，2020，8：84642-84651．
[5]	GRAVES A ．Long short-term memory［M］∥GRAVES A．Supervised sequence labelling with recurrent neural networks．Berlin：Springer，2012：37-45．
[6]	CHO K， VAN MERRIENBOER B， BAHDANAU D，et al ．On the properties of neural machine translation： encoder-decoder approaches［EB/OL］．（2014-10-07）［2024-10-30］．．
[7]	HAN C， ZHOU D， XIE Y，et al ．Collaborative representation with curriculum classifier boosting for unsupervised domain adaptation［J］．Pattern Recognition，2021，113：107802/1-9．
[8]	TANG S， ZOU Y， SONG Z，et al ．Semantic consistency learning on manifold for source data-free unsupervised domain adaptation［J］．Neural Networks，2022，152：467-478．
[9]	KODINARIYA T M， MAKWANA P R ．Review on determining number of Cluster in K-Means clustering［J］．International Journal of Advance Research in Computer Science and Management Studies，2013，1（6）：90-95．
[10]	LIN T Y， MAIRE M， BELONGIE S，et al ．Microsoft COCO： common objects in context［C］∥Proceedings of the 13th European Conference on Computer Vision．Zurich：Springer International Publishing，2014：740-755．
[11]	KARPATHY A， JOULIN A， LI F ．Deep fragment embeddings for bidirectional image sentence mapping［J］．Advances in Neural Information Processing Systems，2014，27：5281/1-9．
[12]	HUANG Y， WANG Y， ZENG Y，et al ．MACK：multimodal aligned conceptual knowledge for unpaired image-text matching［J］．Advances in Neural Information Processing Systems，2022，35：7892-7904．
[13]	FAGHRI F， FLEET D J， KIROS J R，et al ．VSE++：improving visual-semantic embeddings with hard negatives［EB/OL］．（2018-07-29）［2024-10-30］．．
[14]	LEE K H， CHEN X， HUA G，et al ．Stacked cross attention for image-text matching［C］∥Proceedings of the European Conference on Computer Vision．Munich：Springer，2018：201-216．
[15]	ZHANG K， MAO Z， WANG Q，et al ．Negative-aware attention framework for image-text matching［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition．New Orleans：IEEE，2022：15661-15670．
[16]	LIANG Z，WO Y ．From coarse to fine： a two-stage common semantic space construction for unpaired cross modal retrieval［J］．Multimedia Systems，2025，31（1）：80-106．
[17]	OUYANG J， WU H， WANG M，et al ．Contextual similarity aggregation with self-attention for visual re-ranking［J］．Advances in Neural Information Pro-cessing Systems，2021，34：3135-3148．
[18]	WEI W， JIANG M， ZHANG X，et al ．Boosting cross-modal retrieval with MVSE++ and reciprocal neighbors［J］．IEEE Access，2020，8：84642-84651．
[19]	WANG T， XU X， YANG Y，et al ．Matching images and text with multi-modal tensor fusion and re-ranking［C］∥Proceedings of the 27th ACM International Conference on Multimedia．Nice：ACM，2019：12-20．

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献