Unpaired Cross-Modal Retrieval Re-Ranking Based on Neighbor Information Aggregation

WO Yan; LIANG Zhanyang

doi:10.12141/j.issn.1000-565X.240598

Journal of South China University of Technology(Natural Science) >

2025 , Vol. 53 >Issue 11: 18 - 26

DOI: https://doi.org/10.12141/j.issn.1000-565X.240598

Computer Science & Technology

Unpaired Cross-Modal Retrieval Re-Ranking Based on Neighbor Information Aggregation

WO Yan ,
LIANG Zhanyang

Expand

School of Computer Science and Engineering，South China University of Technology，Guangzhou 510006，Guangdong，China

沃焱（1975—），女，博士，教授，主要从事多媒体应用技术研究。E-mail：woyan@scut.edu.cn

Received date: 2024-12-25

Online published: 2025-06-03

Supported by

the Natural Science Foundation of Guangdong Province(2025A1515011905)

Fold

Abstract

As a post-processing technique, re-ranking has demonstrated significant effectiveness in cross-modal retrieval tasks. By mining and processing the information between initial ranking lists, re-ranking process effectively improves retrieval accuracy. The current mainstream cross-modal retrieval re-ranking methods re-rank the initial list based on paired datasets. However, they have poor flexibility because they cannot be easily plugged into existing systems without modifying the original framework and retraining, which makes it difficult to transfer them to other frameworks. Moreover, they cannot be applied in unpaired scenarios. At present, cross-modal retrieval has achieved significant progress by relying on large-scale paired datasets, but it overlooks the problem that labeling such large-scale datasets in practical scenarios requires substantial resources. To address these issues, this paper proposes an unpaired cross-modal retrieval re-ranking method based on neighbor information aggregation. The method improves retrieval performance by mining and utilizing the neighbor information of samples, pushing incorrect answers away from the query input. It searches for local neighbors in the Euclidean neighborhood and for global neighbor expressions through collaborative expression, and then integrates these two types of neighbor information to generate new features for re-calculating semantic similarity with the retrieval input, thus completing a re-ranking process. Finally, the proposed method is applied as a post-processing technique in several cross-modal retrieval model frameworks and is tested on MSCOCO dataset, with its effectiveness and superiority over other re-ranking methods being demonstrated.

Key words： cross-modal retrieval; re-ranking method; neighbor information aggregation; global semantic neighbor; local semantic neighbor

Cite this article

WO Yan , LIANG Zhanyang . Unpaired Cross-Modal Retrieval Re-Ranking Based on Neighbor Information Aggregation[J]. Journal of South China University of Technology(Natural Science), 2025 , 53(11) : 18 -26 . DOI: 10.12141/j.issn.1000-565X.240598

References

[1]	WANG L， LI Y， LAZEBNIK S ．Learning deep structure-preserving image-text embeddings［C］∥Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition．Seattle：IEEE，2016：5005-5013．
[2]	ZHONG Z， ZHENG L， CAO D，et al ．Re-ranking person re-identification with k-reciprocal encoding［C］∥Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition．Honolulu：IEEE，2017：1318-1327．
[3]	SHAO S， CHEN K， KARPUR A，et al ．Global features are all you need for image retrieval and reranking［C］∥Proceedings of 2023 IEEE/CVF International Conference on Computer Vision．Paris：IEEE，2023：11036-11046．
[4]	WEI W， JIANG M， ZHANG X，et al ．Boosting cross-modal retrieval with MVSE++ and reciprocal neighbors［J］．IEEE Access，2020，8：84642-84651．
[5]	GRAVES A ．Long short-term memory［M］∥GRAVES A．Supervised sequence labelling with recurrent neural networks．Berlin：Springer，2012：37-45．
[6]	CHO K， VAN MERRIENBOER B， BAHDANAU D，et al ．On the properties of neural machine translation： encoder-decoder approaches［EB/OL］．（2014-10-07）［2024-10-30］．．
[7]	HAN C， ZHOU D， XIE Y，et al ．Collaborative representation with curriculum classifier boosting for unsupervised domain adaptation［J］．Pattern Recognition，2021，113：107802/1-9．
[8]	TANG S， ZOU Y， SONG Z，et al ．Semantic consistency learning on manifold for source data-free unsupervised domain adaptation［J］．Neural Networks，2022，152：467-478．
[9]	KODINARIYA T M， MAKWANA P R ．Review on determining number of Cluster in K-Means clustering［J］．International Journal of Advance Research in Computer Science and Management Studies，2013，1（6）：90-95．
[10]	LIN T Y， MAIRE M， BELONGIE S，et al ．Microsoft COCO： common objects in context［C］∥Proceedings of the 13th European Conference on Computer Vision．Zurich：Springer International Publishing，2014：740-755．
[11]	KARPATHY A， JOULIN A， LI F ．Deep fragment embeddings for bidirectional image sentence mapping［J］．Advances in Neural Information Processing Systems，2014，27：5281/1-9．
[12]	HUANG Y， WANG Y， ZENG Y，et al ．MACK：multimodal aligned conceptual knowledge for unpaired image-text matching［J］．Advances in Neural Information Processing Systems，2022，35：7892-7904．
[13]	FAGHRI F， FLEET D J， KIROS J R，et al ．VSE++：improving visual-semantic embeddings with hard negatives［EB/OL］．（2018-07-29）［2024-10-30］．．
[14]	LEE K H， CHEN X， HUA G，et al ．Stacked cross attention for image-text matching［C］∥Proceedings of the European Conference on Computer Vision．Munich：Springer，2018：201-216．
[15]	ZHANG K， MAO Z， WANG Q，et al ．Negative-aware attention framework for image-text matching［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition．New Orleans：IEEE，2022：15661-15670．
[16]	LIANG Z，WO Y ．From coarse to fine： a two-stage common semantic space construction for unpaired cross modal retrieval［J］．Multimedia Systems，2025，31（1）：80-106．
[17]	OUYANG J， WU H， WANG M，et al ．Contextual similarity aggregation with self-attention for visual re-ranking［J］．Advances in Neural Information Pro-cessing Systems，2021，34：3135-3148．
[18]	WEI W， JIANG M， ZHANG X，et al ．Boosting cross-modal retrieval with MVSE++ and reciprocal neighbors［J］．IEEE Access，2020，8：84642-84651．
[19]	WANG T， XU X， YANG Y，et al ．Matching images and text with multi-modal tensor fusion and re-ranking［C］∥Proceedings of the 27th ACM International Conference on Multimedia．Nice：ACM，2019：12-20．

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References