Journal of South China University of Technology(Natural Science Edition) ›› 2022, Vol. 50 ›› Issue (6): 1-9.doi: 10.12141/j.issn.1000-565X.210709

Special Issue: 2022年计算机科学与技术

• Computer Science & Technology • Previous Articles     Next Articles

A cross-modal face retrieval method based on metric learning

WO Yan LIANG Jiyun HAN Guoqiang   

  1. School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
  • Received:2021-11-09 Revised:2021-12-31 Online:2022-06-25 Published:2022-02-11
  • Contact: 沃焱(1975-),女,博士,教授,主要从事多媒体应用技术研究。 E-mail:woyan@scut.edu.cn
  • About author:沃焱(1975-),女,博士,教授,主要从事多媒体应用技术研究。
  • Supported by:
    Supported by the Natural Science Foundation of Guangdong Province (2021A1515012020)

Abstract: Metric learning is an important technique to reduce modal differences. Existing cross-modal retrieval methods based on metric learning for cross-modal face retrieval tasks lack attention to pose differences and domain differences, and there are two problems in the process of metric learning: lack of learning of global information and the existence of a large number of redundant triplets. In this paper, a cross-modal common representation generation algorithm based on metric learning is proposed. Our study uses the yaw angle equivariant module to compensate for yaw angle differences so that we can obtain the image features with robustness, uses the multi-layer attention mechanism to obtain video features with differentiability; combines global triplets and local triplets to jointly train the cross-modal common representation generation network, then accelerates the convergence of the loss function through the screening of semi-hard triplets; combines domain calibration and transfer learning to improve the generalization of common representations. Finally, the results of comparison experiments on three face video datasets: PB, YTC and UMD Faces, demonstrate that our algorithm can improve the accuracy of cross-modal face retrieval, and the results of fine-tuning the cross-modal common representation generation network using different numbers of samples demonstrate that our algorithm can improve the accuracy of cross-modal retrieval of target domain images.

Key words: metric learning, cross-modal retrieval, attention mechanism, deep learning