华南理工大学学报(自然科学版) ›› 2023, Vol. 51 ›› Issue (5): 13-23.doi: 10.12141/j.issn.1000-565X.220684

所属专题: 2023年计算机科学与技术

• 计算机科学与技术 • 上一篇    下一篇

基于特征空间嵌入的对比知识蒸馏算法

叶峰 陈彪 赖乙宗   

  1. 华南理工大学 机械与汽车工程学院,广东 广州 510640
  • 收稿日期:2022-10-24 出版日期:2023-05-25 发布日期:2023-01-16
  • 通信作者: 叶峰(1972-),男,博士,副教授,主要从事机器视觉及移动机器人传感控制研究。 E-mail:mefengye@scut.edu.cn
  • 作者简介:叶峰(1972-),男,博士,副教授,主要从事机器视觉及移动机器人传感控制研究。
  • 基金资助:
    广东省重点领域研发计划项目(2021B0101420003)

Contrastive Knowledge Distillation Method Based on Feature Space Embedding

YE Feng CHEN Biao LAI Yizong   

  1. School of Mechanical and Automotive Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China
  • Received:2022-10-24 Online:2023-05-25 Published:2023-01-16
  • Contact: 叶峰(1972-),男,博士,副教授,主要从事机器视觉及移动机器人传感控制研究。 E-mail:mefengye@scut.edu.cn
  • About author:叶峰(1972-),男,博士,副教授,主要从事机器视觉及移动机器人传感控制研究。
  • Supported by:
    the Key-Area R&D Program of Guangdong Province(2021B0101420003)

摘要:

因能有效地压缩卷积神经网络模型,知识蒸馏在深度学习领域备受关注。然而,经典知识蒸馏算法在进行知识迁移时,只利用了单个样本的信息,忽略了样本间关系的重要性,算法性能欠佳。为了提高知识蒸馏算法知识迁移的效率和性能,文中提出了一种基于特征空间嵌入的对比知识蒸馏(FSECD)算法。该算法采用批次内构建策略,将学生模型的输出特征嵌入到教师模型特征空间中,使得每个学生模型的输出特征和教师模型输出的N个特征构成N个对比对。每个对比对中,教师模型的输出特征是已优化、固定的,学生模型的输出特征是待优化、可调优的。在训练过程中,FSECD缩小正对比对的距离并扩大负对比对的距离,使得学生模型可感知并学习教师模型输出特征的样本间关系,进而实现教师模型知识向学生模型的迁移。在CIFAR-100和ImageNet数据集上对不同师生网络架构进行的实验结果表明,与其他主流蒸馏算法相比,FSECD算法在不需要额外的网络结构和数据的情况下,显著提升了性能,进一步证明了样本间关系在知识蒸馏中的重要性。

关键词: 图像分类, 知识蒸馏, 卷积神经网络, 深度学习, 对比学习

Abstract:

Because of its important role in model compression, knowledge distillation has attracted much attention in the field of deep learning. However, the classical knowledge distillation algorithm only uses the information of a single sample, and neglects the importance of the relationship between samples, leading to its poor performance. To improve the efficiency and performance of knowledge transfer in knowledge distillation algorithm, this paper proposed a feature-space-embedding based contrastive knowledge distillation (FSECD) algorithm. The algorithm adopts efficient batch construction strategy, which embeds the student feature into the teacher feature space so that each student feature builds N contrastive pairs with N teacher features. In each pair, the teacher feature is optimized and fixed, while student feature is to be optimized and tunable. In the training process, the distance for positive pairs is narrowed and the distance for negative pairs is expanded, so that student model can perceive and learn the inter-sample relations of teacher model and realize the transfer of knowledge from teacher model to student model. Extensive experiments with different teacher/student architecture settings on CIFAR-100 and ImageNet datasets show that, FSECD algorithm achieves significant performance improvement without additional network structures and data when compared with other cutting-edge distillation methods, which further proves the importance of the inter-sample relations in knowledge distillation.

Key words: image classification, knowledge distillation, convolutional neural network, deep learning, contrastive learning

中图分类号: