华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (3): 50-56.doi: 10.12141/j.issn.1000-565X.240159

• 计算机科学与技术 • 上一篇    下一篇

基于文本-视觉和信息熵最小化的对比学习模型

蔡晓东1(), 董丽芳1, 黄业洋1, 周丽2   

  1. 1.桂林电子科技大学 信息与通信学院,广西 桂林 541004
    2.南宁西岸枫谷商务数据有限公司,广西 南宁 530008
  • 收稿日期:2024-04-07 出版日期:2025-03-10 发布日期:2024-09-13
  • 作者简介:蔡晓东(1971—),男,博士,研究员,主要从事人工智能研究。E-mail: caixiaodong@guet.edu.cn
  • 基金资助:
    广西创新驱动发展专项(AA20302001)

Contrastive Learning Model Based on Text-Visual and Information Entropy Minimization

CAI Xiaodong1(), DONG Lifang1, HUANG Yeyang1, ZHOU Li2   

  1. 1.School of Information and Communication,Guilin University of Electronic Technology,Guilin 541004,Guangxi,China
    2.Nanning West Bund Fenggu Business Data Co. ,Ltd. ,Nanning 530008,Guangxi,China
  • Received:2024-04-07 Online:2025-03-10 Published:2024-09-13
  • Supported by:
    the Guangxi Innovation-Driven Development Project(AA20302001)

摘要:

当前的无监督对比学习方法主要依赖纯文本信息来构建句子嵌入,在全面理解句子所表达的深层含义时存在局限性。同时,传统的对比学习方法过于注重最大化文本正实例之间的互信息,忽视了句子嵌入中潜在的噪声干扰。为了既能保留文本中的有用信息,又能有效地剔除文本嵌入中的噪声干扰,该文提出了一种基于文本-视觉和信息熵最小化的对比学习模型。首先,将文本与对应的视觉信息在对比学习的框架下进行深度融合,共同映射到一个统一的地面空间,并确保它们的表示在该空间中保持一致,从而克服了仅依赖纯文本信息进行句子嵌入学习的限制,使得对比学习过程更加全面且精确;然后,遵循信息最小化原则,在最大化文本正实例间互信息的同时,基于信息熵最小化对文本正实例进行重构。在标准语义文本相似度(STS)任务上的实验结果表明,所提出的模型在Spearman相关系数评价指标上取得了显著提升,相较于现有先进方法具有显著的优势,同时也证明了该模型的有效性。

关键词: 无监督对比学习, 互信息, 文本-视觉, 信息熵最小化, 语义文本相似度

Abstract:

Current unsupervised contrastive learning methods mainly rely on pure textual information to construct sentence embeddings, which presents limitations in comprehensively understanding the deeper meanings conveyed by sentences. Meanwhile, traditional contrastive learning methods focus excessively on maximizing the mutual information between positive instances of text, overlooking the potential noise interference within sentence embeddings. To effectively retain useful information in the text while eliminating noise interference in the embeddings, the paper proposed a contrastive learning model based on text-vision and information entropy minimization. Firstly, the text and the corresponding visual information are deeply fused under the framework of contrastive learning, and jointly mapped to a unified grounding space, ensuring their representations remain consistent within this space. This approach overcomes the limitations of relying solely on pure textual information for sentence embedding learning, making the contrastive learning process more comprehensive and precise. Secondly, following the principle of information minimization, the model reconstructs positive text instances based on information entropy minimization while maximizing mutual information between positive text instances. Experimental results on the standard semantic textual similarity (STS) task demonstrate that the proposed model achieves significant improvements in the Spearman correlation coefficient evaluation metric, showcasing a notable advantage over existing state-of-the-art methods. This also confirms the effectiveness of the proposed model.

Key words: unsupervised contrastive learning, mutual information, text-visual, information entropy minimization, semantic text similarity

中图分类号: