华南理工大学学报(自然科学版) ›› 2022, Vol. 50 ›› Issue (6): 1-9.doi: 10.12141/j.issn.1000-565X.210709

所属专题: 2022年计算机科学与技术

• 计算机科学与技术 • 上一篇    下一篇

基于度量学习的跨模态人脸检索方法

沃焱 梁籍云 韩国强   

  1. 华南理工大学 计算机科学与工程学院,广东 广州 510006
  • 收稿日期:2021-11-09 修回日期:2021-12-31 出版日期:2022-06-25 发布日期:2022-02-11
  • 通信作者: 沃焱(1975-),女,博士,教授,主要从事多媒体应用技术研究。 E-mail:woyan@scut.edu.cn
  • 作者简介:沃焱(1975-),女,博士,教授,主要从事多媒体应用技术研究。
  • 基金资助:
    广东省自然科学基金资助项目 (2021A1515012020); 广州市科技计划项目 (202002030298)

A cross-modal face retrieval method based on metric learning

WO Yan LIANG Jiyun HAN Guoqiang   

  1. School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
  • Received:2021-11-09 Revised:2021-12-31 Online:2022-06-25 Published:2022-02-11
  • Contact: 沃焱(1975-),女,博士,教授,主要从事多媒体应用技术研究。 E-mail:woyan@scut.edu.cn
  • About author:沃焱(1975-),女,博士,教授,主要从事多媒体应用技术研究。
  • Supported by:
    Supported by the Natural Science Foundation of Guangdong Province (2021A1515012020)

摘要: 度量学习是一种减少模态差异的重要技术,已有的基于度量学习的跨模态检索方法用于跨模态人脸检索任务时缺乏对视角差异和域差异的关注,并且在度量学习的过程中存在两个问题:缺乏对全局信息的学习和存在大量冗余三元组。文中提出了一种基于度量学习的跨模态共同表达生成算法,采用偏航角等变模块补偿偏航角差异获取具有鲁棒性的图像特征,使用多层注意力机制获取具有可分性的视频特征;结合全局三元组和局部三元组共同训练跨模态共同表达生成网络提升度量学习的一致性和准确性,同时通过半困难三元组筛选加速了损失函数的收敛;提出结合域校准和迁移学习作为域适应方法提升共同表达的泛化性。最终,在PB、YTC和UMD Faces三个人脸视频数据集中的实验结果证明了本文算法有效提升了跨模态人脸检索的准确性,通过少数样本微调跨模态共同表达生成网络有效提升了目标域图像跨模态检索的准确性。

关键词: 度量学习, 跨模态检索, 注意力机制, 深度学习

Abstract: Metric learning is an important technique to reduce modal differences. Existing cross-modal retrieval methods based on metric learning for cross-modal face retrieval tasks lack attention to pose differences and domain differences, and there are two problems in the process of metric learning: lack of learning of global information and the existence of a large number of redundant triplets. In this paper, a cross-modal common representation generation algorithm based on metric learning is proposed. Our study uses the yaw angle equivariant module to compensate for yaw angle differences so that we can obtain the image features with robustness, uses the multi-layer attention mechanism to obtain video features with differentiability; combines global triplets and local triplets to jointly train the cross-modal common representation generation network, then accelerates the convergence of the loss function through the screening of semi-hard triplets; combines domain calibration and transfer learning to improve the generalization of common representations. Finally, the results of comparison experiments on three face video datasets: PB, YTC and UMD Faces, demonstrate that our algorithm can improve the accuracy of cross-modal face retrieval, and the results of fine-tuning the cross-modal common representation generation network using different numbers of samples demonstrate that our algorithm can improve the accuracy of cross-modal retrieval of target domain images.

Key words: metric learning, cross-modal retrieval, attention mechanism, deep learning