基于文本-视觉和信息熵最小化的对比学习模型

蔡晓东; 董丽芳; 黄业洋; 周丽

doi:10.12141/j.issn.1000-565X.240159

华南理工大学学报(自然科学版) >

2025 , Vol. 53 >Issue 3: 50 - 56

DOI: https://doi.org/10.12141/j.issn.1000-565X.240159

计算机科学与技术

基于文本-视觉和信息熵最小化的对比学习模型

蔡晓东 ,
董丽芳 ,
黄业洋 ,
周丽

展开

^1.桂林电子科技大学信息与通信学院，广西桂林 541004
^2.南宁西岸枫谷商务数据有限公司，广西南宁 530008

蔡晓东（1971—），男，博士，研究员，主要从事人工智能研究。E-mail： caixiaodong@guet.edu.cn

收稿日期: 2024-04-07

网络出版日期: 2024-09-13

基金资助

广西创新驱动发展专项(AA20302001)

收起

Contrastive Learning Model Based on Text-Visual and Information Entropy Minimization

CAI Xiaodong ,
DONG Lifang ,
HUANG Yeyang ,
ZHOU Li

Expand

^1.School of Information and Communication，Guilin University of Electronic Technology，Guilin 541004，Guangxi，China
^2.Nanning West Bund Fenggu Business Data Co. ，Ltd. ，Nanning 530008，Guangxi，China

蔡晓东（1971—），男，博士，研究员，主要从事人工智能研究。E-mail： caixiaodong@guet.edu.cn

Received date: 2024-04-07

Online published: 2024-09-13

Supported by

the Guangxi Innovation-Driven Development Project(AA20302001)

Fold

摘要

当前的无监督对比学习方法主要依赖纯文本信息来构建句子嵌入，在全面理解句子所表达的深层含义时存在局限性。同时，传统的对比学习方法过于注重最大化文本正实例之间的互信息，忽视了句子嵌入中潜在的噪声干扰。为了既能保留文本中的有用信息，又能有效地剔除文本嵌入中的噪声干扰，该文提出了一种基于文本-视觉和信息熵最小化的对比学习模型。首先，将文本与对应的视觉信息在对比学习的框架下进行深度融合，共同映射到一个统一的地面空间，并确保它们的表示在该空间中保持一致，从而克服了仅依赖纯文本信息进行句子嵌入学习的限制，使得对比学习过程更加全面且精确；然后，遵循信息最小化原则，在最大化文本正实例间互信息的同时，基于信息熵最小化对文本正实例进行重构。在标准语义文本相似度（STS）任务上的实验结果表明，所提出的模型在Spearman相关系数评价指标上取得了显著提升，相较于现有先进方法具有显著的优势，同时也证明了该模型的有效性。

关键词： 无监督对比学习; 互信息; 文本-视觉; 信息熵最小化; 语义文本相似度

本文引用格式

蔡晓东 , 董丽芳 , 黄业洋 , 周丽 . 基于文本-视觉和信息熵最小化的对比学习模型[J]. 华南理工大学学报(自然科学版), 2025 , 53(3) : 50 -56 . DOI: 10.12141/j.issn.1000-565X.240159

Abstract

Current unsupervised contrastive learning methods mainly rely on pure textual information to construct sentence embeddings, which presents limitations in comprehensively understanding the deeper meanings conveyed by sentences. Meanwhile, traditional contrastive learning methods focus excessively on maximizing the mutual information between positive instances of text, overlooking the potential noise interference within sentence embeddings. To effectively retain useful information in the text while eliminating noise interference in the embeddings, the paper proposed a contrastive learning model based on text-vision and information entropy minimization. Firstly, the text and the corresponding visual information are deeply fused under the framework of contrastive learning, and jointly mapped to a unified grounding space, ensuring their representations remain consistent within this space. This approach overcomes the limitations of relying solely on pure textual information for sentence embedding learning, making the contrastive learning process more comprehensive and precise. Secondly, following the principle of information minimization, the model reconstructs positive text instances based on information entropy minimization while maximizing mutual information between positive text instances. Experimental results on the standard semantic textual similarity (STS) task demonstrate that the proposed model achieves significant improvements in the Spearman correlation coefficient evaluation metric, showcasing a notable advantage over existing state-of-the-art methods. This also confirms the effectiveness of the proposed model.

Key words： unsupervised contrastive learning; mutual information; text-visual; information entropy minimization; semantic text similarity

参考文献

1	GAO T， YAO X， CHEN D ．SimCSE：simple contrastive learning of sentence embeddings［C］∥ Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing．Punta Cana：The Association for Computational Linguistics，2021：6894-6910．
2	ZHANG Z， CHEN K， WANG R，et al ．Neural machine translation with universal visual representation［C］∥ Proceedings of the 8th International Conference on Learning Representations．［S. l.］：OpenReview.net，2020：1-14.
3	ZHAO Y， TITOV I ．Visually grounded compound PCFGs ［C］∥ Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing．［S. l.］：The Association for Computational Linguistics，2020：4369-4379.
4	LAZARIDOU A， PHAM N T， BARONI M ．Combining language and vision with a multimodal skip-gram model ［C］∥ Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies．Denver：The Association for Computational Linguistics，2015：153-163.
5	ZABLOCKI E， PIWOWARSKI B， SOULIER L，et al. Learning multi-modal word representation grounded in visual context［C］∥ Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence．New Orleans：AAAI，2018：5626-5633.
6	KIELA D， CONNEAU A， JABBRI A，et al ．Learning visually grounded sentence representations［C］∥ Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies．New Orleans：The Association for Computational Linguistics，2018：408-418.
7	BORDES P， ZABLOCKI é， SOULIER L，et al ．Incorporating visual semantics into sentence representations within a grounded space［C］∥ Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong：The Association for Computational Linguistics，2019：696-707.
8	TAN H， BANSAL M ．Vokenization：improving language understanding with contextualized，visual-grounded supervision［C］∥ Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.［S.l.］：The Association for Computational Linguistics，2020：2066-2080.
9	TANG Z， CHO J， TAN H，et al ．Vidlankd：impro-ving language understanding via video-distilled knowledge transfer［J］．Advances in Neural Information Processing Systems，2021，34：24468-24481.
10	ZHOU K， ZHANG B， ZHAO W X，et al ．Debiased contrastive learning of unsupervised sentence representations［C］∥ Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics．Dublin：The Association for Computational Linguistics，2022：6120-6130.
11	HOU P， LI X ．Improving contrastive learning of sentence embeddings with focal infoNCE［C］∥ Procee-dings of the 2023 Conference on Empirical Methods in Natural Language Processing．Singapore：The Association for Computational Linguistics，2023：4757-4762.
12	WU X， GAO C， SU Y，et al ．Smoothed contrastive learning for unsupervised sentence embedding［C］∥ Proceedings of the 29th International Conference on Computational Linguistics.Gyeongju：International Committee on Computational Linguistics，2022：4902-4906.
13	KLEIN T， NABI M ．SCD：self-contrastive decorrelation of sentence embeddings［C］∥ Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics．Dublin：The Association for Computational Linguistics，2022：394-400.
14	TIAN Y， SUN C， POOLE B，et al ．What makes for good views for contrastive learning？［J］．Advances in Neural Information Processing Systems，2020，33：6827-6839.
15	TSAI Y H H， WU Y， SALAKHUTDINOV R，et al ．Self-supervised learning from a multi-view perspective ［C］∥ Proceedings of the 9th International Conference on Learning Representations．［S.l.］：OpenReview.net，2021：1-18.
16	ZHANG M， MOSBACH M， ADELANI D，et al ．MCSE：multimodal contrastive learning of sentence embeddings［C］∥ Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies．Seattle：The Association for Computational Linguistics，2022：5959-5969.
17	ZHANG H， WU C， ZHANG Z，et al ．ResNeSt：split-attention networks［C］∥ Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops．New Orleans：IEEE，2022：2735-2745.
18	CHEN S， ZHOU J， SUN Y，et al ．An information minimization based contrastive learning model for unsupervised sentence embeddings learning［C］∥ Procee-dings of the 29th International Conference on Computational Linguistics．Gyeongju：International Committee on Computational Linguistics，2022：4821-4831.
19	YOUNG P， LAI A， HODOSH M，et al ．From image descriptions to visual denotations： new similarity me-trics for semantic inference over event descriptions［J］．Transactions of the Association for Computational Linguistics，2014，2：67-78.
20	LIN T Y， MAIRE M， BELONGIE S，et al ．Microsoft COCO：common objects in context［C］∥ Proceedings of the 13th European Conference on Computer Vision．Zurich：Springer，2014：740-755.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献