Multi-View Lip Motion and Voice Consistency Judgment Based on Lip Reconstruction and Three-Dimensional Coupled CNN

doi:10.12141/j.issn.1000-565X.220435

Abstract

Abstract:

The traditional consistency judgment methods of lip motion and voice mainly focus on processing the frontal lip motion video，without considering the impact of angle changes on the result during the video acquisition process. In addition, they are prone to ignoring the spatio-temporal characteristics of the lip movement process．Aiming at these problems, this paper focused on the influence of lip angle changes on consistency judgment，combined the advantages of three dimensional convolutional neural networks for non-linear representation and spatio-temporal dimensional feature extraction, and proposed a multi-view lip motion and voice consistency judgment method based on frontal lip reconstruction and three dimensional（3D） coupled convolutional neural network．Firstly，the self-mapping loss was introduced into the generator to improve the effect of frontal reconstruction, and then the lip reconstruction method based on self-mapping supervised cycle-consistent generative adversarial network (SMS-CycleGAN) was used for angle classification and frontal reconstruction of multi-view lip image．Secondly，two heterogeneous three dimensional convolution neural networks were designed to describe the audio and video signals respectively, and then the 3D convolution features containing long-term spatio-temporal correlation information were extracted．Finally, the contrastive loss function was introduced as the correlation discrimination measure of audio and video signal matching, and the output of the audio-video network was coupled into the same representation space for consistency judgment. The experimental results show that the method proposed in this paper can reconstruct frontal lip images of higher quality， and it is better than a variety of comparison methods on the performance of consistency judgment．

Key words: consistency judgment, generative adversarial network, convolutional neural network, frontal reconstruction, multi-modal

CLC Number:

TP391

ZHU Zhengyu, LUO Chao, HE Qianhua, et al. Multi-View Lip Motion and Voice Consistency Judgment Based on Lip Reconstruction and Three-Dimensional Coupled CNN[J]. Journal of South China University of Technology(Natural Science Edition), 2023, 51(5): 70-77.

Figures/Tables 12

Fig.1

Fig.2

Fig.3

Fig.4

Fig.5

Fig.6

Fig.7

Table 1

Fig.8

Table 2

Table 3

Table 4

References 22

1	DEBNATH S， RAMALAKSHMI K， SENBAGAVALLI M ．Multimodal authentication system based on audio-visual data：a review［C］∥ Proceedings of 2022 International Conference for Advancement in Technology. Goa：IEEE，2022：1-5.
2	MIN X， ZHAI G， ZHOU J，et al ．A multimodal saliency model for videos with high audio-visual correspondence ［J］．IEEE Transactions on Image Processing，2020，29：3805-3819.
3	MICHELSANTI D， TAN Z H， ZHANG S X，et al ．An overview of deep-learning-based audio-visual speech enhancement and separation［J］．IEEE/ACM Transactions on Audio，Speech，and Language Processing，2021，29：1368-1396.
4	SAINUI J， SUGIYAMA M ．Minimum dependency key frames selection via quadratic mutual information ［C］∥ Proceedings of 2015 the Tenth International Conference on Digital Information Managemen．Jeju：IEEE，2015：148-153.
5	朱铮宇，贺前华，奉小慧，等．基于时空相关度融合的语音唇动一致性检测算法［J］．电子学报，2014，42（4）：779-785.
	ZHU Zheng-yu， HE Qian-hua， FENG Xiao-hui，et al ．Lip motion and voice consistency algorithm based on fusing spatiotemporal correlation degree ［J］．Acta Electronica Sinica，2014，42（4）：779-785.
6	KUMAR K， NAVRATIL J， MARCHERET E，et al ．Audio-visual speech synchronization detection using a bimodal linear prediction model［C］∥ Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops．Florida：IEEE，2009：53-59.
7	贺前华，朱铮宇，奉小慧．基于平移不变字典的语音唇动一致性判决方法［J］．华中科技大学学报（自然科学版），2015，43（10）：69-74.
	HE Qianhua， ZHU Zhengyu， FENG Xiaohui ．Lip motion and voice consistency analysis algorithm based on shift-invariant dictionary［J］．Journal of Huazhong University of Science and Technology（Natural Science Edition），2015，43（10）：69-74.
8	CHUNG J S， ZISSERMAN A ．Lip reading in profile ［C］∥ Proceedings of 2017 British Machine Vision Conference．London：BMVA，2017：36-46.
9	KIKUCHI T， OZASA Y ．Watch，listen once，and sync：audio-visual synchronization with multi-modal regression CNN［C］∥ Proceedings of 2018 IEEE International Conference on Acoustics，Speech and Signal Processing．Calgary：IEEE，2018：3036-3040.
10	CHENG S， MA P， TZIMIROPOULOS G，et al ．Towards pose-invariant lip-reading ［C］∥ Proceedings of 2020 IEEE International Conference on Acoustics，Speech and Signal Processing．Barcelona：IEEE，2020：4357-4361.
11	MAEDA T， TAMURA S ．Multi-view convolution for lipreading［C］∥ Proceedings of 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference．Tokyo：IEEE，2021：1092-1096.
12	PETRIDIS S， WANG Y， LI Z，et al ．End-to-end multi-view lipreading ［C］∥ Proceedings of 2017 British Machine Vision Conference．London：BMVA，2017：1-14.
13	SARI L， SINGH K， ZHOU J，et al ．A multi-view approach to audio-visual speaker verification［C］∥ Proceedings of 2021 IEEE International Conference on Acoustics，Speech and Signal Processing．Toronto：IEEE，2021：6194-6198.
14	KOUMPAROULIS A， POTAMIANOS G ．Deep view2view mapping for view-invariant lipreading［C］∥ Proceedings of 2018 IEEE Spoken Language Technology Workshop．Athens：IEEE，2018：588-594.
15	EL-SALLAM A A， MIAN A S ．Correlation based speech-video synchronization ［J］．Pattern Recognition Letters，2011，32（6）：780-786.
16	ZHU J Y， PARK T， ISOLA P，et al ．Unpaired image-to-image translation using cycle-consistent adversarial networks［C］∥ Proceedings of 2017 IEEE International Conference on Computer Vision．Venice：IEEE，2017：2223-2232.
17	TANG Z， PENG X， LI K，et al ．Towards efficient U-Nets：a coupled and quantized approach ［J］．IEEE Transactions on Pattern Analysis and Machine Intelligence，2020，42（8）：2018-2050.
18	张瑞峰，白金桐，关欣，等．结合SE与BiSRU的Unet的音乐源分离方法［J］．华南理工大学学报（自然科学版），2021，49（11）：106-115，134.
	ZHANG Ruifeng， BAI Jintong， GUAN Xin，et al ．Music source separation method based on Unet combining SE and BiSRU ［J］．Journal of South China University of Technology （Natural Science Edition），2021，49（11）：106-115，134.
19	ISOLA P， ZHU J Y， ZHOU T，et al ．Image-to-image translation with conditional adversarial networks ［C］∥ Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition．Honolulu：IEEE，2017：5967-5976.
20	HOURRI S， KHARROUBI J ．A deep learning approach for speaker recognition ［J］．International Journal of Speech Technology，2020，23（1）：123-131.
21	MEHROTRA U， GARG S， KRISHNA G，et al ．Detecting multiple disfluencies from speech using pre-linguistic automatic syllabification with acoustic and prosody features［C］∥ Proceedings of 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference．Tokyo：IEEE，2021：761-768.
22	CHUNG J S， ZISSERMAN A ．Out of time：automated lip sync in the wild ［C］∥ Proceedings of ACCV 2016 International Workshops．Taipei：Springer，2016：251-263.

角度/（°）	PSNR			SSIM
角度/（°）	SMS-CGAN	CGAN	V2V	SMS-CGAN	CGAN	V2V
30	29.29	28.23	24.03	0.78	0.77	0.65
45	27.93	26.37	23.56	0.73	0.71	0.67
60	24.78	23.55	22.61	0.72	0.68	0.62
90	19.12	17.43	16.80	0.64	0.61	0.60

不一致数据种类	语音及视频唇动数据来源说明
第一类	不同的人且内容亦非同一句话
第二类	不同的人但内容为同一句话
第三类	同一人但内容非同一句话
第四类	同一人且内容为同一句话，但非同一时刻录制

角度/（°）	总体EER/%						总体AUC
角度/（°）	文中方法	AV-SISR（K=175）	STF	AV-SyncNet	QMI	BLPM	文中方法	AV-SISR（K=175）	STF	AV-SyncNet	QMI	BLPM
0	8.9	15.7	14.8	11.1	20.8	19.3	0.947	0.879	0.885	0.933	0.858	0.860
30	12.3	20.2	17.1	13.2	23.3	23.1	0.920	0.857	0.871	0.905	0.815	0.819
45	17.5	26.7	24.2	18.6	29.7	28.8	0.868	0.768	0.797	0.863	0.735	0.744
60	26.5	33.5	31.1	29.0	36.6	34.9	0.769	0.694	0.721	0.704	0.669	0.679
90	37.1	47.1	39.8	38.3	46.7	44.5	0.665	0.589	0.644	0.659	0.592	0.613

角度/（°）	总体EER/%						总体AUC
角度/（°）	文中方法	AV-SISR（K=175）	STF	AV-SyncNet	QMI	BLPM	文中方法	AV-SISR（K=175）	STF	AV-SyncNet	QMI	BLPM
30	11.9	17.8	16.3	12.4	21.8	22.1	0.925	0.866	0.876	0.917	0.844	0.838
45	14.2	20.9	18.1	15.8	24.7	23.6	0.889	0.857	0.865	0.879	0.787	0.809
60	19.1	23.7	21.6	21.7	26.4	28.6	0.861	0.807	0.848	0.846	0.771	0.747
90	24.4	29.8	27.4	28.1	32.5	34.3	0.793	0.734	0.759	0.751	0.704	0.684

[1]	MA Xiaoliang, AN Lingling, DENG Congjian, et al. Translation Optimization Technology of Automatic Speech Recognition Based on Industry-Specific Vocabulary [J]. Journal of South China University of Technology(Natural Science Edition), 2023, 51(8): 118-125.
[2]	YE Feng, CHEN Biao, LAI Yizong. Contrastive Knowledge Distillation Method Based on Feature Space Embedding [J]. Journal of South China University of Technology(Natural Science Edition), 2023, 51(5): 13-23.
[3]	LUO Yutao, GAO Qiang. Traffic Sign Detection Based on Channel Attention and Feature Enhancement [J]. Journal of South China University of Technology(Natural Science Edition), 2023, 51(12): 64-72.
[4]	QIU Zhibin, LU Zuwen, WANG Haixiang, et al. Recognition of Bird Sounds Related to Power Grid Faults Based on Mel Spectrogram and Convolutional Neural Network [J]. Journal of South China University of Technology(Natural Science Edition), 2022, 50(2): 129-136.
[5]	ZHANG Xiangzhu, ZHANG Lijia, SONG Yifan, et al. Obstacle Avoidance Algorithm for Unmanned Aerial Vehicle Vision Based on Deep Learning [J]. Journal of South China University of Technology (Natural Science Edition), 2022, 50(1): 101-108, 131.
[6]	HUANG Min QI Haitao JIANG Chunlin. Coupled Collaborative Filtering Model Based on Attention Mechanism [J]. Journal of South China University of Technology(Natural Science Edition), 2021, 49(7): 59-65.
[7]	Qi LIU Bin Yu. Pavement Crack Recognition Algorithm Based on Transposed CNN [J]. Journal of South China University of Technology(Natural Science Edition), 2021, 49(12): 124-132.
[8]	LI Bo RAO Haobo. Salient Object Detection Based on Feature Enhancement in Complex Scene [J]. Journal of South China University of Technology (Natural Science Edition), 2021, 49(11): 135-144.
[9]	ZHANG Yujian, LUO Yongfeng, GUO Xiaonong, et al. Seismic Damage Assessment Method for Spatial Grid Structures Considering Multi-modal Contribution [J]. Journal of South China University of Technology (Natural Science Edition), 2021, 49(10): 59-69.
[10]	DU Qiliang, HUANG Liguang, TIAN Lianfang, et al. Recognition of Passengers＇Abnormal Behavior on Escalator Based on Video Monitoring [J]. Journal of South China University of Technology (Natural Science Edition), 2020, 48(8): 10-21.
[11]	CHEN Shanxiong, HAN Xu, LIN Xiaoyu, et al. MSER and CNN-Based Method for Character Detection in Ancient Yi Books [J]. Journal of South China University of Technology (Natural Science Edition), 2020, 48(6): 123-133.
[12]	WEN Huiying ZHANG Weigang ZHAO Sheng. Vehicle Lane-Change Trajectory Prediction Model Based on Generative Adversarial Networks [J]. Journal of South China University of Technology (Natural Science Edition), 2020, 48(5): 32-40.
[13]	FAN Zizhu, WANG Song, ZHANG Hong, et al. W-Net-Based Segmentation for Ｒemote Sensing Satellite Image of High Resolution [J]. Journal of South China University of Technology (Natural Science Edition), 2020, 48(12): 114-124.
[14]	LIU Jianguo, FENG Yunjian, JI Guo, et al. Improved Stereo Matching Algorithm Based on PSMNet [J]. Journal of South China University of Technology (Natural Science Edition), 2020, 48(1): 60-69,83.
[15]	SUN Jifeng ZHU Yating WANG Kai. Motion Deblurring Based on DeblurGAN and Low Rank Decomposition [J]. Journal of South China University of Technology (Natural Science Edition), 2020, 48(1): 32-41,50.