基于度量学习的跨模态人脸检索方法

沃焱; 梁籍云; 韩国强

doi:10.12141/j.issn.1000-565X.210709

华南理工大学学报(自然科学版) >

2022 , Vol. 50 >Issue 6: 1 - 9

DOI: https://doi.org/10.12141/j.issn.1000-565X.210709

计算机科学与技术

基于度量学习的跨模态人脸检索方法

沃焱 ,
梁籍云 ,
韩国强

展开

华南理工大学计算机科学与工程学院，广东广州 510006

沃焱(1975-)，女，博士，教授，主要从事多媒体应用技术研究。

收稿日期: 2021-11-09

修回日期: 2021-12-31

网络出版日期: 2022-02-11

基金资助

广东省自然科学基金资助项目 (2021A1515012020); 广州市科技计划项目 (202002030298)

收起

A Cross-Modal Face Retrieval Algorithm Based on Metric Learning

WO Yan ,
LIANG Ji-Yun ,
HAN Guo-Qiang

Expand

School of Computer Science and Engineering，South China University of Technology，Guangzhou 510006，Guangdong，China

沃焱(1975-)，女，博士，教授，主要从事多媒体应用技术研究。

Received date: 2021-11-09

Revised date: 2021-12-31

Online published: 2022-02-11

Supported by

Supported by the Natural Science Foundation of Guangdong Province (2021A1515012020)

Fold

摘要

度量学习是一种减少模态差异的重要技术，已有的基于度量学习的跨模态检索方法用于跨模态人脸检索任务时缺乏对视角差异和域差异的关注，并且在度量学习的过程中存在两个问题：缺乏对全局信息的学习和存在大量冗余三元组。文中提出了一种基于度量学习的跨模态共同表达生成算法，采用偏航角等变模块补偿偏航角差异获取具有鲁棒性的图像特征，使用多层注意力机制获取具有可分性的视频特征；结合全局三元组和局部三元组共同训练跨模态共同表达生成网络提升度量学习的一致性和准确性，同时通过半困难三元组筛选加速了损失函数的收敛；提出结合域校准和迁移学习作为域适应方法提升共同表达的泛化性。最终，在PB、YTC和UMD Faces三个人脸视频数据集中的实验结果证明了本文算法有效提升了跨模态人脸检索的准确性，通过少数样本微调跨模态共同表达生成网络有效提升了目标域图像跨模态检索的准确性。

关键词： 度量学习; 跨模态检索; 注意力机制; 深度学习

本文引用格式

沃焱 , 梁籍云 , 韩国强 . 基于度量学习的跨模态人脸检索方法[J]. 华南理工大学学报(自然科学版), 2022 , 50(6) : 1 -9 . DOI: 10.12141/j.issn.1000-565X.210709

Abstract

Metric learning is an important technique to reduce modal differences. Existing cross-modal retrieval methods based on metric learning for cross-modal face retrieval tasks lack attention to pose differences and domain differences, and there are two problems in the process of metric learning: lack of learning of global information and the existence of a large number of redundant triplets. In this paper, a cross-modal common representation generation algorithm based on metric learning is proposed. Our study uses the yaw angle equivariant module to compensate for yaw angle differences so that we can obtain the image features with robustness, uses the multi-layer attention mechanism to obtain video features with differentiability; combines global triplets and local triplets to jointly train the cross-modal common representation generation network, then accelerates the convergence of the loss function through the screening of semi-hard triplets; combines domain calibration and transfer learning to improve the generalization of common representations. Finally, the results of comparison experiments on three face video datasets: PB, YTC and UMD Faces, demonstrate that our algorithm can improve the accuracy of cross-modal face retrieval, and the results of fine-tuning the cross-modal common representation generation network using different numbers of samples demonstrate that our algorithm can improve the accuracy of cross-modal retrieval of target domain images.

Key words： metric learning; cross-modal retrieval; attention mechanism; deep learning

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract