华南理工大学学报(自然科学版) ›› 2023, Vol. 51 ›› Issue (5): 70-77.doi: 10.12141/j.issn.1000-565X.220435

所属专题: 2023年电子、通信与自动控制

• 电子、通信与自动控制 • 上一篇    下一篇

基于唇重构与三维耦合CNN的多视角音唇一致性判别

朱铮宇1,2 罗超2 贺前华1 彭炜锋2 毛志炜2 张顺四3   

  1. 1.华南理工大学 音频、语音与视觉处理实验室,广东 广州 510640
    2.广东技术师范大学 网络空间安全学院,广东 广州 510665
    3.广州趣丸网络科技有限公司,广东 广州 510665
  • 收稿日期:2022-07-08 出版日期:2023-05-25 发布日期:2022-10-20
  • 通信作者: 彭炜锋(1976-),男,博士,讲师,主要从事语音信号处理研究。 E-mail:pengweifeng0215@163.com
  • 作者简介:朱铮宇(1984-),男,博士后,讲师,主要从事音视频多模态信号处理研究。E-mail:zhuzhengyu0701@163.com
  • 基金资助:
    国家自然科学基金资助项目(61672173);国家重点研发计划项目(2018YFB1802200)

Multi-View Lip Motion and Voice Consistency Judgment Based on Lip Reconstruction and Three-Dimensional Coupled CNN

ZHU Zhengyu1,LUO ChaoHE QianhuaPENG WeifengMAO ZhiweiZHANG Shunsi3   

  1. 1.Audio, Speech and Vision Processing Laboratory, South China University of Technology, Guangzhou 510640, Guangdong, China
    2.School of Cyber Security, Guangdong Polytechnic Normal University, Guangzhou 510665, Guangdong, China
    3.Guangzhou Quwan Network Technology Co. , Ltd. , Guangzhou 510665, Guangdong, China
  • Received:2022-07-08 Online:2023-05-25 Published:2022-10-20
  • Contact: 彭炜锋(1976-),男,博士,讲师,主要从事语音信号处理研究。 E-mail:pengweifeng0215@163.com
  • About author:朱铮宇(1984-),男,博士后,讲师,主要从事音视频多模态信号处理研究。E-mail:zhuzhengyu0701@163.com
  • Supported by:
    the National Natural Science Foundation of China(61672173);the National Key R&D Program of China(2018YFB1802200)

摘要:

针对传统音唇一致性判别方法主要对正面唇动视频进行处理,未考虑视频采集角度变化对结果的影响,且容易忽略唇动过程中的时空特性等不足,文中以唇部角度变化对一致性判别的影响为研究重心,结合三维卷积神经网络在非线性表示和时空维度特征提取上的优势,提出了基于正面唇重构与三维耦合卷积神经网络的多视角音唇一致性判别方法。该方法先通过在生成器中引入自映射损失来提高正面重建效果,并采用基于自映射监督循环一致性生成对抗网络(SMS-CycleGAN)的唇重构方法对多视角唇图进行角度分类及正面重构;然后设计两个异构三维卷积神经网络,分别用来描述音频和视频信号,并提取包含长时时空关联信息的三维卷积特征;最后引入对比损失函数作为音视频信号匹配的相关度鉴别度量,将音视频网络输出耦合到同一表示空间,并进行一致性判别。实验结果表明,文中方法能重建出更高质量的正面唇图,一致性判别性能优于多种不同类型的比较方法。

关键词: 一致性判别, 生成对抗网络, 卷积神经网络, 正面重构, 多模态

Abstract:

The traditional consistency judgment methods of lip motion and voice mainly focus on processing the frontal lip motion video,without considering the impact of angle changes on the result during the video acquisition process. In addition, they are prone to ignoring the spatio-temporal characteristics of the lip movement process.Aiming at these problems, this paper focused on the influence of lip angle changes on consistency judgment,combined the advantages of three dimensional convolutional neural networks for non-linear representation and spatio-temporal dimensional feature extraction, and proposed a multi-view lip motion and voice consistency judgment method based on frontal lip reconstruction and three dimensional(3D) coupled convolutional neural network.Firstly,the self-mapping loss was introduced into the generator to improve the effect of frontal reconstruction, and then the lip reconstruction method based on self-mapping supervised cycle-consistent generative adversarial network (SMS-CycleGAN) was used for angle classification and frontal reconstruction of multi-view lip image.Secondly,two heterogeneous three dimensional convolution neural networks were designed to describe the audio and video signals respectively, and then the 3D convolution features containing long-term spatio-temporal correlation information were extracted.Finally, the contrastive loss function was introduced as the correlation discrimination measure of audio and video signal matching, and the output of the audio-video network was coupled into the same representation space for consistency judgment. The experimental results show that the method proposed in this paper can reconstruct frontal lip images of higher quality, and it is better than a variety of comparison methods on the performance of consistency judgment.

Key words: consistency judgment, generative adversarial network, convolutional neural network, frontal reconstruction, multi-modal

中图分类号: