基于唇重构与三维耦合CNN的多视角音唇一致性判别

doi:10.12141/j.issn.1000-565X.220435

华南理工大学学报(自然科学版) ›› 2023, Vol. 51 ›› Issue (5): 70-77.doi: 10.12141/j.issn.1000-565X.220435

所属专题： 2023年电子、通信与自动控制

• 电子、通信与自动控制 • 上一篇下一篇

基于唇重构与三维耦合CNN的多视角音唇一致性判别

朱铮宇¹^,² 罗超² 贺前华¹ 彭炜锋² 毛志炜² 张顺四³

^1.华南理工大学音频、语音与视觉处理实验室，广东广州 510640
^2.广东技术师范大学网络空间安全学院，广东广州 510665
^3.广州趣丸网络科技有限公司，广东广州 510665

收稿日期:2022-07-08 出版日期:2023-05-25 发布日期:2022-10-20
通信作者: 彭炜锋（1976-），男，博士，讲师，主要从事语音信号处理研究。 E-mail:pengweifeng0215@163.com
作者简介:朱铮宇（1984-），男，博士后，讲师，主要从事音视频多模态信号处理研究。E-mail:zhuzhengyu0701@163.com
基金资助:
国家自然科学基金资助项目(61672173);国家重点研发计划项目(2018YFB1802200)

Multi-View Lip Motion and Voice Consistency Judgment Based on Lip Reconstruction and Three-Dimensional Coupled CNN

ZHU Zhengyu¹^,²LUO Chao²HE Qianhua¹PENG Weifeng²MAO Zhiwei²ZHANG Shunsi³

^1.Audio, Speech and Vision Processing Laboratory, South China University of Technology, Guangzhou 510640, Guangdong, China
^2.School of Cyber Security, Guangdong Polytechnic Normal University, Guangzhou 510665, Guangdong, China
^3.Guangzhou Quwan Network Technology Co. , Ltd. , Guangzhou 510665, Guangdong, China

Received:2022-07-08 Online:2023-05-25 Published:2022-10-20
Contact: 彭炜锋（1976-），男，博士，讲师，主要从事语音信号处理研究。 E-mail:pengweifeng0215@163.com
About author:朱铮宇（1984-），男，博士后，讲师，主要从事音视频多模态信号处理研究。E-mail:zhuzhengyu0701@163.com
Supported by:
the National Natural Science Foundation of China(61672173);the National Key R&D Program of China(2018YFB1802200)

摘要/Abstract

摘要：

针对传统音唇一致性判别方法主要对正面唇动视频进行处理，未考虑视频采集角度变化对结果的影响，且容易忽略唇动过程中的时空特性等不足，文中以唇部角度变化对一致性判别的影响为研究重心，结合三维卷积神经网络在非线性表示和时空维度特征提取上的优势，提出了基于正面唇重构与三维耦合卷积神经网络的多视角音唇一致性判别方法。该方法先通过在生成器中引入自映射损失来提高正面重建效果，并采用基于自映射监督循环一致性生成对抗网络（SMS-CycleGAN）的唇重构方法对多视角唇图进行角度分类及正面重构；然后设计两个异构三维卷积神经网络，分别用来描述音频和视频信号，并提取包含长时时空关联信息的三维卷积特征；最后引入对比损失函数作为音视频信号匹配的相关度鉴别度量，将音视频网络输出耦合到同一表示空间，并进行一致性判别。实验结果表明，文中方法能重建出更高质量的正面唇图，一致性判别性能优于多种不同类型的比较方法。

关键词: 一致性判别, 生成对抗网络, 卷积神经网络, 正面重构, 多模态

Abstract:

The traditional consistency judgment methods of lip motion and voice mainly focus on processing the frontal lip motion video，without considering the impact of angle changes on the result during the video acquisition process. In addition, they are prone to ignoring the spatio-temporal characteristics of the lip movement process．Aiming at these problems, this paper focused on the influence of lip angle changes on consistency judgment，combined the advantages of three dimensional convolutional neural networks for non-linear representation and spatio-temporal dimensional feature extraction, and proposed a multi-view lip motion and voice consistency judgment method based on frontal lip reconstruction and three dimensional（3D） coupled convolutional neural network．Firstly，the self-mapping loss was introduced into the generator to improve the effect of frontal reconstruction, and then the lip reconstruction method based on self-mapping supervised cycle-consistent generative adversarial network (SMS-CycleGAN) was used for angle classification and frontal reconstruction of multi-view lip image．Secondly，two heterogeneous three dimensional convolution neural networks were designed to describe the audio and video signals respectively, and then the 3D convolution features containing long-term spatio-temporal correlation information were extracted．Finally, the contrastive loss function was introduced as the correlation discrimination measure of audio and video signal matching, and the output of the audio-video network was coupled into the same representation space for consistency judgment. The experimental results show that the method proposed in this paper can reconstruct frontal lip images of higher quality， and it is better than a variety of comparison methods on the performance of consistency judgment．

Key words: consistency judgment, generative adversarial network, convolutional neural network, frontal reconstruction, multi-modal

中图分类号:

TP391

朱铮宇, 罗超, 贺前华, 等. 基于唇重构与三维耦合CNN的多视角音唇一致性判别[J]. 华南理工大学学报(自然科学版), 2023, 51(5): 70-77.

ZHU Zhengyu, LUO Chao, HE Qianhua, et al. Multi-View Lip Motion and Voice Consistency Judgment Based on Lip Reconstruction and Three-Dimensional Coupled CNN[J]. Journal of South China University of Technology(Natural Science Edition), 2023, 51(5): 70-77.

图/表 12

图1

图2

图3

图4

图5

图6

图7

表1

图8

表2

表3

表4

参考文献 22

1	DEBNATH S， RAMALAKSHMI K， SENBAGAVALLI M ．Multimodal authentication system based on audio-visual data：a review［C］∥ Proceedings of 2022 International Conference for Advancement in Technology. Goa：IEEE，2022：1-5.
2	MIN X， ZHAI G， ZHOU J，et al ．A multimodal saliency model for videos with high audio-visual correspondence ［J］．IEEE Transactions on Image Processing，2020，29：3805-3819.
3	MICHELSANTI D， TAN Z H， ZHANG S X，et al ．An overview of deep-learning-based audio-visual speech enhancement and separation［J］．IEEE/ACM Transactions on Audio，Speech，and Language Processing，2021，29：1368-1396.
4	SAINUI J， SUGIYAMA M ．Minimum dependency key frames selection via quadratic mutual information ［C］∥ Proceedings of 2015 the Tenth International Conference on Digital Information Managemen．Jeju：IEEE，2015：148-153.
5	朱铮宇，贺前华，奉小慧，等．基于时空相关度融合的语音唇动一致性检测算法［J］．电子学报，2014，42（4）：779-785.
	ZHU Zheng-yu， HE Qian-hua， FENG Xiao-hui，et al ．Lip motion and voice consistency algorithm based on fusing spatiotemporal correlation degree ［J］．Acta Electronica Sinica，2014，42（4）：779-785.
6	KUMAR K， NAVRATIL J， MARCHERET E，et al ．Audio-visual speech synchronization detection using a bimodal linear prediction model［C］∥ Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops．Florida：IEEE，2009：53-59.
7	贺前华，朱铮宇，奉小慧．基于平移不变字典的语音唇动一致性判决方法［J］．华中科技大学学报（自然科学版），2015，43（10）：69-74.
	HE Qianhua， ZHU Zhengyu， FENG Xiaohui ．Lip motion and voice consistency analysis algorithm based on shift-invariant dictionary［J］．Journal of Huazhong University of Science and Technology（Natural Science Edition），2015，43（10）：69-74.
8	CHUNG J S， ZISSERMAN A ．Lip reading in profile ［C］∥ Proceedings of 2017 British Machine Vision Conference．London：BMVA，2017：36-46.
9	KIKUCHI T， OZASA Y ．Watch，listen once，and sync：audio-visual synchronization with multi-modal regression CNN［C］∥ Proceedings of 2018 IEEE International Conference on Acoustics，Speech and Signal Processing．Calgary：IEEE，2018：3036-3040.
10	CHENG S， MA P， TZIMIROPOULOS G，et al ．Towards pose-invariant lip-reading ［C］∥ Proceedings of 2020 IEEE International Conference on Acoustics，Speech and Signal Processing．Barcelona：IEEE，2020：4357-4361.
11	MAEDA T， TAMURA S ．Multi-view convolution for lipreading［C］∥ Proceedings of 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference．Tokyo：IEEE，2021：1092-1096.
12	PETRIDIS S， WANG Y， LI Z，et al ．End-to-end multi-view lipreading ［C］∥ Proceedings of 2017 British Machine Vision Conference．London：BMVA，2017：1-14.
13	SARI L， SINGH K， ZHOU J，et al ．A multi-view approach to audio-visual speaker verification［C］∥ Proceedings of 2021 IEEE International Conference on Acoustics，Speech and Signal Processing．Toronto：IEEE，2021：6194-6198.
14	KOUMPAROULIS A， POTAMIANOS G ．Deep view2view mapping for view-invariant lipreading［C］∥ Proceedings of 2018 IEEE Spoken Language Technology Workshop．Athens：IEEE，2018：588-594.
15	EL-SALLAM A A， MIAN A S ．Correlation based speech-video synchronization ［J］．Pattern Recognition Letters，2011，32（6）：780-786.
16	ZHU J Y， PARK T， ISOLA P，et al ．Unpaired image-to-image translation using cycle-consistent adversarial networks［C］∥ Proceedings of 2017 IEEE International Conference on Computer Vision．Venice：IEEE，2017：2223-2232.
17	TANG Z， PENG X， LI K，et al ．Towards efficient U-Nets：a coupled and quantized approach ［J］．IEEE Transactions on Pattern Analysis and Machine Intelligence，2020，42（8）：2018-2050.
18	张瑞峰，白金桐，关欣，等．结合SE与BiSRU的Unet的音乐源分离方法［J］．华南理工大学学报（自然科学版），2021，49（11）：106-115，134.
	ZHANG Ruifeng， BAI Jintong， GUAN Xin，et al ．Music source separation method based on Unet combining SE and BiSRU ［J］．Journal of South China University of Technology （Natural Science Edition），2021，49（11）：106-115，134.
19	ISOLA P， ZHU J Y， ZHOU T，et al ．Image-to-image translation with conditional adversarial networks ［C］∥ Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition．Honolulu：IEEE，2017：5967-5976.
20	HOURRI S， KHARROUBI J ．A deep learning approach for speaker recognition ［J］．International Journal of Speech Technology，2020，23（1）：123-131.
21	MEHROTRA U， GARG S， KRISHNA G，et al ．Detecting multiple disfluencies from speech using pre-linguistic automatic syllabification with acoustic and prosody features［C］∥ Proceedings of 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference．Tokyo：IEEE，2021：761-768.
22	CHUNG J S， ZISSERMAN A ．Out of time：automated lip sync in the wild ［C］∥ Proceedings of ACCV 2016 International Workshops．Taipei：Springer，2016：251-263.

角度/（°）	PSNR			SSIM
角度/（°）	SMS-CGAN	CGAN	V2V	SMS-CGAN	CGAN	V2V
30	29.29	28.23	24.03	0.78	0.77	0.65
45	27.93	26.37	23.56	0.73	0.71	0.67
60	24.78	23.55	22.61	0.72	0.68	0.62
90	19.12	17.43	16.80	0.64	0.61	0.60

不一致数据种类	语音及视频唇动数据来源说明
第一类	不同的人且内容亦非同一句话
第二类	不同的人但内容为同一句话
第三类	同一人但内容非同一句话
第四类	同一人且内容为同一句话，但非同一时刻录制

角度/（°）	总体EER/%						总体AUC
角度/（°）	文中方法	AV-SISR（K=175）	STF	AV-SyncNet	QMI	BLPM	文中方法	AV-SISR（K=175）	STF	AV-SyncNet	QMI	BLPM
0	8.9	15.7	14.8	11.1	20.8	19.3	0.947	0.879	0.885	0.933	0.858	0.860
30	12.3	20.2	17.1	13.2	23.3	23.1	0.920	0.857	0.871	0.905	0.815	0.819
45	17.5	26.7	24.2	18.6	29.7	28.8	0.868	0.768	0.797	0.863	0.735	0.744
60	26.5	33.5	31.1	29.0	36.6	34.9	0.769	0.694	0.721	0.704	0.669	0.679
90	37.1	47.1	39.8	38.3	46.7	44.5	0.665	0.589	0.644	0.659	0.592	0.613

角度/（°）	总体EER/%						总体AUC
角度/（°）	文中方法	AV-SISR（K=175）	STF	AV-SyncNet	QMI	BLPM	文中方法	AV-SISR（K=175）	STF	AV-SyncNet	QMI	BLPM
30	11.9	17.8	16.3	12.4	21.8	22.1	0.925	0.866	0.876	0.917	0.844	0.838
45	14.2	20.9	18.1	15.8	24.7	23.6	0.889	0.857	0.865	0.879	0.787	0.809
60	19.1	23.7	21.6	21.7	26.4	28.6	0.861	0.807	0.848	0.846	0.771	0.747
90	24.4	29.8	27.4	28.1	32.5	34.3	0.793	0.734	0.759	0.751	0.704	0.684

[1]	马晓亮, 安玲玲, 邓从健, 等. 基于行业词表的自动语音转写后优化技术[J]. 华南理工大学学报(自然科学版), 2023, 51(8): 118-125.
[2]	叶峰, 陈彪, 赖乙宗. 基于特征空间嵌入的对比知识蒸馏算法[J]. 华南理工大学学报(自然科学版), 2023, 51(5): 13-23.
[3]	罗玉涛, 高强. 基于通道注意力和特征增强的交通标志检测[J]. 华南理工大学学报(自然科学版), 2023, 51(12): 64-72.
[4]	邱志斌, 卢祖文, 王海祥, 等. 基于Mel频谱图和CNN的电网涉鸟故障鸟声识别[J]. 华南理工大学学报(自然科学版), 2022, 50(2): 129-136.
[5]	张香竹, 张立家, 宋逸凡, 等. 基于深度学习的无人机单目视觉避障算法[J]. 华南理工大学学报（自然科学版）, 2022, 50(1): 101-108, 131.
[6]	黄敏齐海涛蒋春林. 基于注意力机制的耦合协同过滤模型[J]. 华南理工大学学报(自然科学版), 2021, 49(7): 59-65.
[7]	刘奇, 于斌, 孟祥成, 等. 基于转置卷积神经网络的路面裂缝识别算法[J]. 华南理工大学学报(自然科学版), 2021, 49(12): 124-132.
[8]	李波饶浩波. 复杂场景下特征增强的显著性目标检测方法[J]. 华南理工大学学报（自然科学版）, 2021, 49(11): 135-144.
[9]	谢康, 陈晓斌, 尧俊凯, 等. 基于机器视觉的建筑垃圾填料物质组分图像分析方法[J]. 华南理工大学学报（自然科学版）, 2021, 49(10): 50-58,69.
[10]	张玉建, 罗永峰, 郭小农, 等. 考虑多模态贡献的空间网格结构地震损伤评估方法[J]. 华南理工大学学报（自然科学版）, 2021, 49(10): 59-69.
[11]	杜启亮, 黄理广, 田联房, 等. 基于视频监控的手扶电梯乘客异常行为识别[J]. 华南理工大学学报（自然科学版）, 2020, 48(8): 10-21.
[12]	陈善雄, 韩旭, 林小渝, 等. 基于 MSER 和 CNN 的彝文古籍文献的字符检测方法[J]. 华南理工大学学报（自然科学版）, 2020, 48(6): 123-133.
[13]	温惠英张伟罡赵胜. 基于生成对抗网络的车辆换道轨迹预测模型[J]. 华南理工大学学报（自然科学版）, 2020, 48(5): 32-40.
[14]	范自柱, 王松, 张泓, 等. 基于 W- Net 的高分辨率遥感卫星图像分割 [J]. 华南理工大学学报（自然科学版）, 2020, 48(12): 114-124.
[15]	刘小兰叶泽慧. 基于 StarGAN 和子空间学习的缺失多视图聚类[J]. 华南理工大学学报（自然科学版）, 2020, 48(11): 87-98.

基于唇重构与三维耦合CNN的多视角音唇一致性判别

Multi-View Lip Motion and Voice Consistency Judgment Based on Lip Reconstruction and Three-Dimensional Coupled CNN

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 22

相关文章 15

编辑推荐

Metrics

本文评价