Journal of South China University of Technology(Natural Science Edition) ›› 2023, Vol. 51 ›› Issue (5): 70-77.doi: 10.12141/j.issn.1000-565X.220435

Special Issue: 2023年电子、通信与自动控制

• Electronics, Communication & Automation Technology • Previous Articles     Next Articles

Multi-View Lip Motion and Voice Consistency Judgment Based on Lip Reconstruction and Three-Dimensional Coupled CNN

ZHU Zhengyu1,LUO ChaoHE QianhuaPENG WeifengMAO ZhiweiZHANG Shunsi3   

  1. 1.Audio, Speech and Vision Processing Laboratory, South China University of Technology, Guangzhou 510640, Guangdong, China
    2.School of Cyber Security, Guangdong Polytechnic Normal University, Guangzhou 510665, Guangdong, China
    3.Guangzhou Quwan Network Technology Co. , Ltd. , Guangzhou 510665, Guangdong, China
  • Received:2022-07-08 Online:2023-05-25 Published:2022-10-20
  • Contact: 彭炜锋(1976-),男,博士,讲师,主要从事语音信号处理研究。 E-mail:pengweifeng0215@163.com
  • About author:朱铮宇(1984-),男,博士后,讲师,主要从事音视频多模态信号处理研究。E-mail:zhuzhengyu0701@163.com
  • Supported by:
    the National Natural Science Foundation of China(61672173);the National Key R&D Program of China(2018YFB1802200)

Abstract:

The traditional consistency judgment methods of lip motion and voice mainly focus on processing the frontal lip motion video,without considering the impact of angle changes on the result during the video acquisition process. In addition, they are prone to ignoring the spatio-temporal characteristics of the lip movement process.Aiming at these problems, this paper focused on the influence of lip angle changes on consistency judgment,combined the advantages of three dimensional convolutional neural networks for non-linear representation and spatio-temporal dimensional feature extraction, and proposed a multi-view lip motion and voice consistency judgment method based on frontal lip reconstruction and three dimensional(3D) coupled convolutional neural network.Firstly,the self-mapping loss was introduced into the generator to improve the effect of frontal reconstruction, and then the lip reconstruction method based on self-mapping supervised cycle-consistent generative adversarial network (SMS-CycleGAN) was used for angle classification and frontal reconstruction of multi-view lip image.Secondly,two heterogeneous three dimensional convolution neural networks were designed to describe the audio and video signals respectively, and then the 3D convolution features containing long-term spatio-temporal correlation information were extracted.Finally, the contrastive loss function was introduced as the correlation discrimination measure of audio and video signal matching, and the output of the audio-video network was coupled into the same representation space for consistency judgment. The experimental results show that the method proposed in this paper can reconstruct frontal lip images of higher quality, and it is better than a variety of comparison methods on the performance of consistency judgment.

Key words: consistency judgment, generative adversarial network, convolutional neural network, frontal reconstruction, multi-modal

CLC Number: