CODS：用于粤剧人声合成的音频-文本对齐数据集

doi:10.12141/j.issn.1000-565X.250134

摘要/Abstract

摘要：

中国戏曲文化作为中国传统艺术之一，具有独特的音乐表现力。粤剧是中国主要戏曲剧种之一，是岭南文化的重要载体，被列入世界非物质文化遗产名录。近年来，生成式人工智能技术展现了其在内容创作领域的强大能力，如歌声合成技术能够根据指定乐谱合成自然的歌声，这为粤剧的数字化保护与创新提供了全新思路。然而，戏曲数据的收集与整理面临音频质量不佳、方言标注复杂等问题，导致高质量戏曲数据集极为匮乏。基于此，该文将流行音乐领域的歌声合成技术应用到粤剧人声合成领域，提出了音素级标注的音频-文本对齐的粤剧人声合成数据集（CODS）。首先，通过系统化的流程构建了数据集CODS，该数据集源自4位著名表演者的29部原创作品，总时长为3.81 h，为粤剧研究和数字化提供了重要支持；然后，在该数据集上，采用深度学习方法进行实验，实现了歌词、音色和旋律可控的粤剧人声合成；最后，建立了一套粤剧人声合成评估方案，主客观评价结果达到了领域内良好水平，验证了所制作数据集的可用性。该文构建的数据集CODS成功填补了人工智能在粤剧人声合成领域的空白，有力推动了这一传统艺术的传承与创新。

关键词: 粤剧, 生成式人工智能, 数据集, 人声合成

Abstract:

As one of the traditional Chinese arts, Chinese opera culture has unique musical expressiveness. Cantonese opera, as one of the main Chinese opera genres and an important carrier of Lingnan culture, has been indexed in the World Intangible Cultural Heritage List. In recent years, generative artificial intelligence technology has demonstrated its powerful capabilities in the field of content creation. For example, singing synthesis techno-logy can synthesize natural singing based on specified music scores. This provides a new idea for the digital protection and innovation of Cantonese opera. However, the collection and organization of opera data faces problems such as poor audio quality and complex dialect annotation, resulting in an extreme shortage of high-quality opera data sets. Based on this, this paper applied the singing synthesis technology in the field of pop music to the field of Cantonese opera vocal synthesis, and proposed the first Cantonese opera vocal synthesis dataset with phoneme-level annotation and audio-text alignment. Firstly, this paper constructed the CODS dataset through a systematic process. This dataset was derived from 29 original works by four famous performers with a total length of 3.81 hours, which provides important support for the research and digitization of Cantonese opera. Using this dataset, this paper conducted experiments with a deep learning-based method for Cantonese opera voice synthesis, realizing controllable generation in terms of lyrics, timbre, and melody. Finally, this paper established a comprehensive evaluation framework for Cantonese opera synthesis. Both objective and subjective evaluations reached a satisfactory level within the domain, further validating the usability of the proposed dataset. The CODS dataset constructed in this paper successfully filled the gap in artificial intelligence in the field of Cantonese opera vocal synthesis, and strongly promoted the inheritance and innovation of this traditional art.

Key words: Cantonese opera, generative artificial intelligence, dataset, voice synthesis

中图分类号:

TP39

李粤, 黄奕翰, 彭郑威, 谢吉轩, 杜宇烨. CODS：用于粤剧人声合成的音频-文本对齐数据集[J]. 华南理工大学学报(自然科学版), 2025, 53(9): 1-10.

LI Yue, HUANG Yihan, PENG Zhengwei, XIE Jixuan, DU Yuye. CODS: An Audio-Text Aligned Dataset for Cantonese Opera Vocal Synthesis[J]. Journal of South China University of Technology(Natural Science Edition), 2025, 53(9): 1-10.

图/表 15

图1

表1

图2

图3

图4

图5

表2

图6

图7

表3

图8

表4

表5

客观评价结果"

模型	$M$ /dB	$R f$ /Hz	$C f$	$H$
FT-GAN	5.97	36.97	0.813	17.74
FastSpeech 2	6.07	41.44	0.788	17.82
DiffSinger	6.94	38.24	0.795	17.33

表5

图9

图10

参考文献 36

[1]	XU J ．The language features and cultural implication of Cantonese opera librettos［J］．Frontiers in Art Research，2021，3（2）：20-29.
[2]	单韵鸣，杜金凤．地方非物质文化遗产的传播困境与现代化发展模式的探索：以粤剧为例［J］．学术研究，2024（8）：54-60.
	SHAN Yunming， DU Jinfeng ．The promotion issues and the modemization model of the regional intangible cultural heritage：a case study of cantonese opem［J］．Academic Research，2024（8）：54-60.
[3]	ZHANG L， LI R， WANG S，et al ．M4Singer：a multi-style，multi-singer and musical score provided mandarin singing corpus［C］∥ Proceedings of Advances in Neural Information Processing Systems．New Orleans：MIT Press，2022：6914-6926.
[4]	ZHANG Z， ZHENG Y， LI X，et al ．WeSinger 2：fully parallel singing voice synthesis via multi-singer conditional adversarial training［C］∥ Proceedings of 2023 IEEE International Conference on Acoustics，Speech and Signal Processing．Rhodes Island：IEEE，2023：1-5.
[5]	WANG C， ZENG C， HE X ．Xiaoicesing 2：a high-fidelity singing voice synthesizer based on generative adversarial network［EB/OL］．（2022-10-26）［2025-05-05］．.
[6]	LIU X， ZHANG W， ZHENG Z，et al ．FGP-GAN：fine-grained perception integrated generative adversarial network for expressive mandarin singing voice synthesis ［J］．IEEE Transactions on Consumer Electronics，2024，70（3）：6054-6063.
[7]	CUI J， GU Y， WENG C，et al ．Sifisinger：a high-fidelity end-to-end singing voice synthesizer based on source-filter model［C］∥ Proceedings of 2024 IEEE International Conference on Acoustics，Speech and Signal Processing．Seoul：IEEE，2024：11126-11130.
[8]	LIU J， LI C， REN Y，et al ．DiffSinger：singing voice synthesis via shallow diffusion mechanism［C］∥ Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence．Vancouver：AAAI，2022：11020-11028.
[9]	REPETTO R C， SERRA X ．A collection of music scores for corpus based jingju singing research［C］∥ Proceedings of the 18th International Society for Music Information Retrieval Conference．Suzhou：ISMIR，2017：46-52.
[10]	ZHENG M， BAI P， SHI X，et al ．FT-GAN：fine-grained tune modeling for Chinese opera synthesis［C］∥ Proceedings of the Thirty-Eighth AAAI Confe-rence on Artificial Intelligence．Vancouver：AAAI，2024：19697-19705.
[11]	BOŽIĆ M， HORVAT M ．A survey of deep learning audio generation methods［EB/OL］．（2024-05-31）［2025-05-05］．.
[12]	KIM J， KONG J，SON J ．Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech［C］∥ Proceedings of the 38th International Conference on Machine Learning．Vienna：ML Research Press，2021：5530-5540.
[13]	ZHANG Y， CONG J， XUE H，et al ．VISinger：variational inference with adversarial learning for end-to-end singing voice synthesis［C］∥ Proceedings of 2022 IEEE International Conference on Acoustics，Speech and Signal Processing．Singapore：IEEE，2022：7237-7241.
[14]	ZHANG Y， XUE H， LI H，et al ．VISinger 2：high-fidelity end-to-end singing voice synthesis enhanced by digital signal processing synthesizer［EB/OL］．（2022-11-05）［2025-05-05］．.
[15]	HWANG J S， LEE S H， LEE S W ．HiddenSinger：high-quality singing voice synthesis via neural audio codec and latent diffusion models［J］．Neural Networks，2025，181：106762/1-10.
[16]	ZHANG Y， HUANG R， LI R，et al ．StyleSinger：style transfer for out-of-domain singing voice synthesis［C］∥ Proceedings of the Thirty-Eighth AAAI Confe-rence on Artificial Intelligence．Vancouver：AAAI，2024：19597-19605.
[17]	BLACK D A， LI M， TIAN M ．Automatic identification of emotional cues in Chinese opera singing［C］∥ Proceedings of the 13th International Conference on Music Perception and Cognition．Seoul：［s.n.］，2014：250-255.
[18]	ISLAM R， XU M， FAN Y ．Chinese traditional opera database for music genre recognition［C］∥ Procee-dings of the 18th Oriental COCOSDA/CASLRE．Shanghai：IEEE，2015：38-41.
[19]	LI Y， PENG Z， XU D，et al ．RoleNet：a multiple features fusion network for role classification in Canto-nese opera［J］．Multimedia Tools and Applications，（2025-01-28）．.
[20]	CHEN Q， ZHAO W， WANG Q，et al ．The sustai-nable development of intangible cultural heritage with AI：Cantonese opera singing genre classification based on CoGCNet model in China［J］．Sustainability，2022，14：2923/1-20.
[21]	LI Q， HU B ．Joint time and frequency transformer for Chinese opera classification［C］∥ Proceedings of Interspeech 2023．Dublin：ISCA，2023：3919-3923.
[22]	WU Y， LI S， YU C，et al ．Peking Opera synthesis via duration informed attention network［EB/OL］．（2020-08-07）［2025-05-05］．.
[23]	BAI P， ZHOU Y， ZHENG M，et al ．Improving Chinese pop song and Hokkien Gezi Opera singing voice synthesis by enhancing local modeling［C］∥ Procee-dings of the 2023 Conference on Empirical Methods in Natural Language Processing．Singapore：ACL，2023：3302-3312.
[24]	ZHOU X， SUN W， SHI X ．A high-quality melody-aware Peking Opera synthesizer using data augmentation［C］∥ Proceedings of 2023 IEEE International Confe-rence on Multimedia and Expo．Brisbane：IEEE，2023：1092-1097.
[25]	REN Y， HU C， TAN X，et al ．FastSpeech 2：fast and high-quality end-to-end text to speech［EB/OL］．（2020-06-08）［2025-05-05］．.
[26]	MCAULIFFE M， SOCOLOF M， MIHUC S，et al ．Montreal forced aligner：trainable text-speech alignment using kaldi［C］∥ Proceedings of Interspeech 2017．Stockholm：ISCA，2017：498-502.
[27]	SOLOVYEY R， STEMPKOVSKIY A， HABRUSEVA T ．Benchmarks and leaderboards for sound demixing tasks［EB/OL］．（2024-05-07）［2025-05-05］．.
[28]	BOERSMA P ．Praat：doing phonetics by computer ［CP/OL］．（2011-05-01）［2025-05-05］．.
[29]	DAI S， WU Y， CHEN S，et al ．SingStyle111：a multilingual singing dataset with style transfer［C］∥ Proceedings of the 24th International Society for Music Information Retrieval Conference.Milan：ISMIR，2023：765-773.
[30]	JADOUL Y， THOMPSON B， DE BOER B ．Introdu-cing Parselmouth：a Python interface to Praat［J］．Journal of Phonetics，2018，71：1-15.
[31]	LI R， ZHANG Y， WANG Y，et al ．Robust singing voice transcription serves synthesis［EB/OL］．（2024-05-16）［2025-05-05］．.
[32]	关子尹，邓伟生，赵子明．粤语审音配词字库［DB/OL］．（2003-01-12）［2025-05-05］．.
[33]	KONG J， KIM J，BAE J ．HiFi-GAN：generative adversarial networks for efficient and high fidelity speech synthesis［C］∥ Advances in Neural Information Processing Systems 33：34th Conference on Neural Information Processing Systems．San Diego：Neural Information Processing Systems Foundation，Inc.，2020：17022-17033.
[34]	VASWANI A， SHAZEER N， PARMAR N，et al ．Attention is all you need［C］∥ Proceedings of Advances in Neural Information Processing Systems．Long Beach：MIT Press，2017：6000-6010.
[35]	GULATI A， QIN J， CHIU C C，et al ．Conformer：convolution-augmented transformer for speech recognition［EB/OL］．（2020-05-16）［2025-05-05］．.
[36]	CHEN J， TAN X， LUAN J，et al ．Hifisinger：towards high-fidelity neural singing voice synthesis［EB/OL］．（2020-09-03）［2025-05-05］．.

数据集	标注信息	总时长/h	音素级标注时长/h
京剧纯人声^［9］	音素	7.00	1.70
闽南歌仔戏^［10］	部分音素	4.54	4.54
5种中西方戏剧^［17］	剧种	2.04
14种中国戏剧^［18］	剧种	10.23
粤剧纯人声^［19］	角色	2.50

演唱者ID	姓名	性别	音域	时长/h
1	红线女	女	E3-C6（52-84）	0.92
2	罗家宝	男	G2-C_#5（43-73）	0.95
3	陈笑风	男	A2-C_#5（45-73）	1.01
4	文千岁	男	G2-D5（43-74）	0.93

音素	频次		频次		频次		频次
s	1 514	k	136	ak	65	ong	337
m	973	kw	24	e	1 365	ok	141
j	1 367	aa	4 782	ei	514	u	1 623
c	874	aai	151	eng	36	ui	101
l	880	aau	54	i	4 649	un	104
z	920	aam	69	iu	398	ung	498
h	758	aan	356	im	85	uk	194
f	489	aang	101	in	456	oe	758
g	756	aak	77	ing	599	oeng	446
w	386	ai	389	ip	34	oek	50
n	446	au	530	it	80	ng	2 799
t	402	am	336	ik	106	eoi	368
b	544	an	552	o	3 244	eon	82
d	476	ang	133	oi	404	yu	977
gw	159	ap	57	ou	407	yun	233
p	147	at	355	on	66	yut	98

模型	发音准确度	韵律自然度	情感传达力	合成纯净度
GT	4.52 ± 0.03	4.38 ± 0.04	4.42 ± 0.03	4.44 ± 0.03
FT-GAN	4.13 ± 0.03	4.12 ± 0.03	4.16 ± 0.04	4.10 ± 0.03
FastSpeech 2	4.06 ± 0.03	3.86 ± 0.03	3.79 ± 0.04	4.14 ± 0.03
DiffSinger	3.98 ± 0.03	3.91 ± 0.04	3.75 ± 0.03	3.85 ± 0.03