华南理工大学学报(自然科学版) ›› 2024, Vol. 52 ›› Issue (10): 41-50.doi: 10.12141/j.issn.1000-565X.230673

• 计算机科学与技术 • 上一篇    下一篇

基于语义-视觉一致性约束的零样本图像语义分割网络

陈琼(), 冯媛, 李志群, 杨咏   

  1. 华南理工大学 计算机科学与工程学院,广东 广州 510006
  • 收稿日期:2023-10-29 出版日期:2024-10-25 发布日期:2023-12-27
  • 作者简介:陈琼(1966—),女,博士,副教授,主要从事机器学习、不平衡分类、图像分类与分割、深度强化学习研究。E-mail: csqchen@scut.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(62176095)

Semantic-Visual Consistency Constraint Network for Zero-Shot Image Semantic Segmentation

CHEN Qiong(), FENG Yuan, LI Zhiqun, YANG Yong   

  1. School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
  • Received:2023-10-29 Online:2024-10-25 Published:2023-12-27
  • About author:陈琼(1966—),女,博士,副教授,主要从事机器学习、不平衡分类、图像分类与分割、深度强化学习研究。E-mail: csqchen@scut.edu.cn
  • Supported by:
    the National Natural Science Foundation of China(62176095)

摘要:

零样本图像语义分割是零样本学习在视觉领域的重要任务之一,旨在分割训练中未见的新类别。目前基于像素级视觉特征生成的方法合成的视觉特征分布和真实的视觉特征分布存在不一致性的问题,合成的视觉特征难以准确反映类语义信息,导致合成的视觉特征缺乏鉴别性;现有的一些视觉特征生成方法为了得到语义特征所表达的区分性信息,需要消耗巨大的计算资源。为此,文中提出了一种基于语义-视觉一致性约束的零样本图像语义分割网络(SVCCNet)。该网络通过语义-视觉一致性约束模块对语义特征与视觉特征进行相互转换,以提高两者的关联度,减小真实视觉特征与合成视觉特征空间结构的差异性,从而缓解合成视觉特征与真实视觉特征分布不一致的问题。语义-视觉一致性约束模块通过两个相互约束的重建映射,实现了视觉特征与类别语义的对应关系,同时保持了较低的模型复杂度。在PASCAL-VOC及PASCAL-Context数据集上的实验结果表明,SVCCNet的像素准确率、平均准确率、平均交并比、调和交并比均优于比较的主流方法。

关键词: 语义分割, 特征生成, 零样本学习, 计算机视觉, 深度学习

Abstract:

Zero-shot image semantic segmentation is one of the important tasks in the visual field of zero-shot learning, aiming to segment novel categories unseen during training. The current distribution of visual features based on pixel-level visual feature generation is inconsistent with real visual feature distribution. The synthesized visual features inadequately reflect class semantic information, leading to low discriminability in these features. Some existing generative methods consume significant computational resources to obtain the discriminative information conveyed by semantic features. In view of the above problems, this paper proposed a zero-shot image semantic segmentation network called SVCCNet, which is based on semantic-visual consistency constraints. SVCCNet uses a semantic-visual consistency constraint module to facilitate the mutual transformation between semantic features and visual features, enhancing their correlation and diminishing the disparity between the spatial structures of real and synthesized visual features, which mitigates the inconsistency problem between the distributions of synthesized and real visual features. The semantic-visual consistency constraint module achieves the correspondence between visual features and class semantics through two mutually constrained reconstruction mappings, while maintaining low model complexity. Experimental results on the PASCAL-VOC and PASCAL-Context datasets demonstrate that SVCCNet outperforms mainstream methods in terms of pixel accuracy, mean accuracy, mean intersection over union (IoU), and harmonic IoU.

Key words: semantic segmentation, feature generation, zero-shot learning, computer vision, deep learning

中图分类号: