Unpaired Cross-Modal Retrieval Re-Ranking Based on Neighbor Information Aggregation

doi:10.12141/j.issn.1000-565X.240598

Abstract

Abstract:

As a post-processing technique, re-ranking has demonstrated significant effectiveness in cross-modal retrieval tasks. By mining and processing the information between initial ranking lists, re-ranking process effectively improves retrieval accuracy. The current mainstream cross-modal retrieval re-ranking methods re-rank the initial list based on paired datasets. However, they have poor flexibility because they cannot be easily plugged into existing systems without modifying the original framework and retraining, which makes it difficult to transfer them to other frameworks. Moreover, they cannot be applied in unpaired scenarios. At present, cross-modal retrieval has achieved significant progress by relying on large-scale paired datasets, but it overlooks the problem that labeling such large-scale datasets in practical scenarios requires substantial resources. To address these issues, this paper proposes an unpaired cross-modal retrieval re-ranking method based on neighbor information aggregation. The method improves retrieval performance by mining and utilizing the neighbor information of samples, pushing incorrect answers away from the query input. It searches for local neighbors in the Euclidean neighborhood and for global neighbor expressions through collaborative expression, and then integrates these two types of neighbor information to generate new features for re-calculating semantic similarity with the retrieval input, thus completing a re-ranking process. Finally, the proposed method is applied as a post-processing technique in several cross-modal retrieval model frameworks and is tested on MSCOCO dataset, with its effectiveness and superiority over other re-ranking methods being demonstrated.

Key words: cross-modal retrieval, re-ranking method, neighbor information aggregation, global semantic neighbor, local semantic neighbor

CLC Number:

TP39

WO Yan, LIANG Zhanyang. Unpaired Cross-Modal Retrieval Re-Ranking Based on Neighbor Information Aggregation[J]. Journal of South China University of Technology(Natural Science Edition), 2025, 53(11): 18-26.

Figures/Tables 7

Fig.1

Fig.2

Table 1

Table 2

Fig.3

Fig.4

Fig.5

References 19

[1]	WANG L， LI Y， LAZEBNIK S ．Learning deep structure-preserving image-text embeddings［C］∥Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition．Seattle：IEEE，2016：5005-5013．
[2]	ZHONG Z， ZHENG L， CAO D，et al ．Re-ranking person re-identification with k-reciprocal encoding［C］∥Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition．Honolulu：IEEE，2017：1318-1327．
[3]	SHAO S， CHEN K， KARPUR A，et al ．Global features are all you need for image retrieval and reranking［C］∥Proceedings of 2023 IEEE/CVF International Conference on Computer Vision．Paris：IEEE，2023：11036-11046．
[4]	WEI W， JIANG M， ZHANG X，et al ．Boosting cross-modal retrieval with MVSE++ and reciprocal neighbors［J］．IEEE Access，2020，8：84642-84651．
[5]	GRAVES A ．Long short-term memory［M］∥GRAVES A．Supervised sequence labelling with recurrent neural networks．Berlin：Springer，2012：37-45．
[6]	CHO K， VAN MERRIENBOER B， BAHDANAU D，et al ．On the properties of neural machine translation： encoder-decoder approaches［EB/OL］．（2014-10-07）［2024-10-30］．．
[7]	HAN C， ZHOU D， XIE Y，et al ．Collaborative representation with curriculum classifier boosting for unsupervised domain adaptation［J］．Pattern Recognition，2021，113：107802/1-9．
[8]	TANG S， ZOU Y， SONG Z，et al ．Semantic consistency learning on manifold for source data-free unsupervised domain adaptation［J］．Neural Networks，2022，152：467-478．
[9]	KODINARIYA T M， MAKWANA P R ．Review on determining number of Cluster in K-Means clustering［J］．International Journal of Advance Research in Computer Science and Management Studies，2013，1（6）：90-95．
[10]	LIN T Y， MAIRE M， BELONGIE S，et al ．Microsoft COCO： common objects in context［C］∥Proceedings of the 13th European Conference on Computer Vision．Zurich：Springer International Publishing，2014：740-755．
[11]	KARPATHY A， JOULIN A， LI F ．Deep fragment embeddings for bidirectional image sentence mapping［J］．Advances in Neural Information Processing Systems，2014，27：5281/1-9．
[12]	HUANG Y， WANG Y， ZENG Y，et al ．MACK：multimodal aligned conceptual knowledge for unpaired image-text matching［J］．Advances in Neural Information Processing Systems，2022，35：7892-7904．
[13]	FAGHRI F， FLEET D J， KIROS J R，et al ．VSE++：improving visual-semantic embeddings with hard negatives［EB/OL］．（2018-07-29）［2024-10-30］．．
[14]	LEE K H， CHEN X， HUA G，et al ．Stacked cross attention for image-text matching［C］∥Proceedings of the European Conference on Computer Vision．Munich：Springer，2018：201-216．
[15]	ZHANG K， MAO Z， WANG Q，et al ．Negative-aware attention framework for image-text matching［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition．New Orleans：IEEE，2022：15661-15670．
[16]	LIANG Z，WO Y ．From coarse to fine： a two-stage common semantic space construction for unpaired cross modal retrieval［J］．Multimedia Systems，2025，31（1）：80-106．
[17]	OUYANG J， WU H， WANG M，et al ．Contextual similarity aggregation with self-attention for visual re-ranking［J］．Advances in Neural Information Pro-cessing Systems，2021，34：3135-3148．
[18]	WEI W， JIANG M， ZHANG X，et al ．Boosting cross-modal retrieval with MVSE++ and reciprocal neighbors［J］．IEEE Access，2020，8：84642-84651．
[19]	WANG T， XU X， YANG Y，et al ．Matching images and text with multi-modal tensor fusion and re-ranking［C］∥Proceedings of the 27th ACM International Conference on Multimedia．Nice：ACM，2019：12-20．

基础框架	重排序方法	不同方向的R@1		不同方向的R@5		不同方向的R@10		rSum
基础框架	重排序方法	I2T	T2I	I2T	T2I	I2T	T2I	rSum
MACK		25.3	17.3	47.9	34.4	61.2	54.1	240.2
	MCSA	26.3	17.9	48.5	34.6	61.4	54.1	242.8
	MVSE++	25.1	17.8	48.1	34.6	61.4	54.4	241.4
	MTFN	25.7	17.6	48.4	34.7	61.7	54.3	242.4
	文中方法	26.8	18.1	48.8	34.8	61.9	54.7	245.1
TSCSC		36.7	27.8	48.6	43.5	63.4	52.1	272.1
	MCSA	40.6	29.8	49.9	44.3	63.9	52.5	281.0
	MVSE++	41.0	29.9	50.2	44.4	64.4	52.6	282.5
	MTFN	41.4	30.1	50.4	44.6	64.5	52.6	283.6
	文中方法	41.9	30.6	50.9	44.9	64.8	52.9	286.0
VSE++		62.9	50.8	88.5	83.5	94.5	91.0	471.2
	MCSA	65.6	52.4	88.8	83.9	94.7	91.4	476.8
	MVSE++	65.4	52.6	89.3	83.9	94.8	91.2	477.2
	MTFN	65.7	52.7	89.6	84.0	94.8	91.3	478.1
	文中方法	66.3	53.1	89.7	84.2	94.7	91.4	479.4
SCAN		71.9	57.4	94.2	87.5	97.8	94.3	503.1
	MCSA	72.3	57.6	94.5	87.8	97.6	93.9	503.7
	MVSE++	71.8	57.9	94.2	87.8	97.8	94.6	504.1
	MTFN	72.3	58.2	94.3	87.7	97.9	94.5	504.9
	文中方法	72.8	58.7	94.3	87.5	97.5	94.6	505.4
NAAF		79.3	63.1	95.8	90.0	98.0	95.9	522.1
	MCSA	79.2	63.1	96.1	90.1	97.5	95.6	521.6
	MVSE++	79.4	63.2	95.9	90.0	98.0	96.1	522.6
	MTFN	79.6	63.4	96.1	90.1	98.1	96.0	523.3
	文中方法	79.9	63.8	96.1	90.4	97.8	96.1	524.1

设置	不同方向的R@1		不同方向的R@5		不同方向的R@10		rSum
设置	I2T	T2I	I2T	T2I	I2T	T2I	rSum
1	36.7	27.8	48.6	43.5	63.4	52.1	272.1
2	41.1	30.1	50.6	44.4	64.6	52.7	283.5
3	39.5	29.3	49.7	44.2	64.1	52.4	279.2
4	39.8	29.5	50.0	44.5	64.3	52.5	280.6
5	41.9	30.6	50.9	44.9	64.8	52.9	286.0