CODS: An Audio-Text Aligned Dataset for Cantonese Opera Vocal Synthesis

doi:10.12141/j.issn.1000-565X.250134

Abstract

Abstract:

As one of the traditional Chinese arts, Chinese opera culture has unique musical expressiveness. Cantonese opera, as one of the main Chinese opera genres and an important carrier of Lingnan culture, has been indexed in the World Intangible Cultural Heritage List. In recent years, generative artificial intelligence technology has demonstrated its powerful capabilities in the field of content creation. For example, singing synthesis techno-logy can synthesize natural singing based on specified music scores. This provides a new idea for the digital protection and innovation of Cantonese opera. However, the collection and organization of opera data faces problems such as poor audio quality and complex dialect annotation, resulting in an extreme shortage of high-quality opera data sets. Based on this, this paper applied the singing synthesis technology in the field of pop music to the field of Cantonese opera vocal synthesis, and proposed the first Cantonese opera vocal synthesis dataset with phoneme-level annotation and audio-text alignment. Firstly, this paper constructed the CODS dataset through a systematic process. This dataset was derived from 29 original works by four famous performers with a total length of 3.81 hours, which provides important support for the research and digitization of Cantonese opera. Using this dataset, this paper conducted experiments with a deep learning-based method for Cantonese opera voice synthesis, realizing controllable generation in terms of lyrics, timbre, and melody. Finally, this paper established a comprehensive evaluation framework for Cantonese opera synthesis. Both objective and subjective evaluations reached a satisfactory level within the domain, further validating the usability of the proposed dataset. The CODS dataset constructed in this paper successfully filled the gap in artificial intelligence in the field of Cantonese opera vocal synthesis, and strongly promoted the inheritance and innovation of this traditional art.

Key words: Cantonese opera, generative artificial intelligence, dataset, voice synthesis

CLC Number:

TP39

LI Yue, HUANG Yihan, PENG Zhengwei, XIE Jixuan, DU Yuye. CODS: An Audio-Text Aligned Dataset for Cantonese Opera Vocal Synthesis[J]. Journal of South China University of Technology(Natural Science Edition), 2025, 53(9): 1-10.

Figures/Tables 15

Fig.1

Table 1

Fig.2

Fig.3

Fig.4

Fig.5

Table 2

Fig.6

Fig.7

Table 3

Fig.8

Table 4

Table 5

Objective evaluation results"

模型	$M$ /dB	$R f$ /Hz	$C f$	$H$
FT-GAN	5.97	36.97	0.813	17.74
FastSpeech 2	6.07	41.44	0.788	17.82
DiffSinger	6.94	38.24	0.795	17.33

Table 5

Fig.9

Fig.10

References 36

[1]	XU J ．The language features and cultural implication of Cantonese opera librettos［J］．Frontiers in Art Research，2021，3（2）：20-29.
[2]	单韵鸣，杜金凤．地方非物质文化遗产的传播困境与现代化发展模式的探索：以粤剧为例［J］．学术研究，2024（8）：54-60.
	SHAN Yunming， DU Jinfeng ．The promotion issues and the modemization model of the regional intangible cultural heritage：a case study of cantonese opem［J］．Academic Research，2024（8）：54-60.
[3]	ZHANG L， LI R， WANG S，et al ．M4Singer：a multi-style，multi-singer and musical score provided mandarin singing corpus［C］∥ Proceedings of Advances in Neural Information Processing Systems．New Orleans：MIT Press，2022：6914-6926.
[4]	ZHANG Z， ZHENG Y， LI X，et al ．WeSinger 2：fully parallel singing voice synthesis via multi-singer conditional adversarial training［C］∥ Proceedings of 2023 IEEE International Conference on Acoustics，Speech and Signal Processing．Rhodes Island：IEEE，2023：1-5.
[5]	WANG C， ZENG C， HE X ．Xiaoicesing 2：a high-fidelity singing voice synthesizer based on generative adversarial network［EB/OL］．（2022-10-26）［2025-05-05］．.
[6]	LIU X， ZHANG W， ZHENG Z，et al ．FGP-GAN：fine-grained perception integrated generative adversarial network for expressive mandarin singing voice synthesis ［J］．IEEE Transactions on Consumer Electronics，2024，70（3）：6054-6063.
[7]	CUI J， GU Y， WENG C，et al ．Sifisinger：a high-fidelity end-to-end singing voice synthesizer based on source-filter model［C］∥ Proceedings of 2024 IEEE International Conference on Acoustics，Speech and Signal Processing．Seoul：IEEE，2024：11126-11130.
[8]	LIU J， LI C， REN Y，et al ．DiffSinger：singing voice synthesis via shallow diffusion mechanism［C］∥ Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence．Vancouver：AAAI，2022：11020-11028.
[9]	REPETTO R C， SERRA X ．A collection of music scores for corpus based jingju singing research［C］∥ Proceedings of the 18th International Society for Music Information Retrieval Conference．Suzhou：ISMIR，2017：46-52.
[10]	ZHENG M， BAI P， SHI X，et al ．FT-GAN：fine-grained tune modeling for Chinese opera synthesis［C］∥ Proceedings of the Thirty-Eighth AAAI Confe-rence on Artificial Intelligence．Vancouver：AAAI，2024：19697-19705.
[11]	BOŽIĆ M， HORVAT M ．A survey of deep learning audio generation methods［EB/OL］．（2024-05-31）［2025-05-05］．.
[12]	KIM J， KONG J，SON J ．Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech［C］∥ Proceedings of the 38th International Conference on Machine Learning．Vienna：ML Research Press，2021：5530-5540.
[13]	ZHANG Y， CONG J， XUE H，et al ．VISinger：variational inference with adversarial learning for end-to-end singing voice synthesis［C］∥ Proceedings of 2022 IEEE International Conference on Acoustics，Speech and Signal Processing．Singapore：IEEE，2022：7237-7241.
[14]	ZHANG Y， XUE H， LI H，et al ．VISinger 2：high-fidelity end-to-end singing voice synthesis enhanced by digital signal processing synthesizer［EB/OL］．（2022-11-05）［2025-05-05］．.
[15]	HWANG J S， LEE S H， LEE S W ．HiddenSinger：high-quality singing voice synthesis via neural audio codec and latent diffusion models［J］．Neural Networks，2025，181：106762/1-10.
[16]	ZHANG Y， HUANG R， LI R，et al ．StyleSinger：style transfer for out-of-domain singing voice synthesis［C］∥ Proceedings of the Thirty-Eighth AAAI Confe-rence on Artificial Intelligence．Vancouver：AAAI，2024：19597-19605.
[17]	BLACK D A， LI M， TIAN M ．Automatic identification of emotional cues in Chinese opera singing［C］∥ Proceedings of the 13th International Conference on Music Perception and Cognition．Seoul：［s.n.］，2014：250-255.
[18]	ISLAM R， XU M， FAN Y ．Chinese traditional opera database for music genre recognition［C］∥ Procee-dings of the 18th Oriental COCOSDA/CASLRE．Shanghai：IEEE，2015：38-41.
[19]	LI Y， PENG Z， XU D，et al ．RoleNet：a multiple features fusion network for role classification in Canto-nese opera［J］．Multimedia Tools and Applications，（2025-01-28）．.
[20]	CHEN Q， ZHAO W， WANG Q，et al ．The sustai-nable development of intangible cultural heritage with AI：Cantonese opera singing genre classification based on CoGCNet model in China［J］．Sustainability，2022，14：2923/1-20.
[21]	LI Q， HU B ．Joint time and frequency transformer for Chinese opera classification［C］∥ Proceedings of Interspeech 2023．Dublin：ISCA，2023：3919-3923.
[22]	WU Y， LI S， YU C，et al ．Peking Opera synthesis via duration informed attention network［EB/OL］．（2020-08-07）［2025-05-05］．.
[23]	BAI P， ZHOU Y， ZHENG M，et al ．Improving Chinese pop song and Hokkien Gezi Opera singing voice synthesis by enhancing local modeling［C］∥ Procee-dings of the 2023 Conference on Empirical Methods in Natural Language Processing．Singapore：ACL，2023：3302-3312.
[24]	ZHOU X， SUN W， SHI X ．A high-quality melody-aware Peking Opera synthesizer using data augmentation［C］∥ Proceedings of 2023 IEEE International Confe-rence on Multimedia and Expo．Brisbane：IEEE，2023：1092-1097.
[25]	REN Y， HU C， TAN X，et al ．FastSpeech 2：fast and high-quality end-to-end text to speech［EB/OL］．（2020-06-08）［2025-05-05］．.
[26]	MCAULIFFE M， SOCOLOF M， MIHUC S，et al ．Montreal forced aligner：trainable text-speech alignment using kaldi［C］∥ Proceedings of Interspeech 2017．Stockholm：ISCA，2017：498-502.
[27]	SOLOVYEY R， STEMPKOVSKIY A， HABRUSEVA T ．Benchmarks and leaderboards for sound demixing tasks［EB/OL］．（2024-05-07）［2025-05-05］．.
[28]	BOERSMA P ．Praat：doing phonetics by computer ［CP/OL］．（2011-05-01）［2025-05-05］．.
[29]	DAI S， WU Y， CHEN S，et al ．SingStyle111：a multilingual singing dataset with style transfer［C］∥ Proceedings of the 24th International Society for Music Information Retrieval Conference.Milan：ISMIR，2023：765-773.
[30]	JADOUL Y， THOMPSON B， DE BOER B ．Introdu-cing Parselmouth：a Python interface to Praat［J］．Journal of Phonetics，2018，71：1-15.
[31]	LI R， ZHANG Y， WANG Y，et al ．Robust singing voice transcription serves synthesis［EB/OL］．（2024-05-16）［2025-05-05］．.
[32]	关子尹，邓伟生，赵子明．粤语审音配词字库［DB/OL］．（2003-01-12）［2025-05-05］．.
[33]	KONG J， KIM J，BAE J ．HiFi-GAN：generative adversarial networks for efficient and high fidelity speech synthesis［C］∥ Advances in Neural Information Processing Systems 33：34th Conference on Neural Information Processing Systems．San Diego：Neural Information Processing Systems Foundation，Inc.，2020：17022-17033.
[34]	VASWANI A， SHAZEER N， PARMAR N，et al ．Attention is all you need［C］∥ Proceedings of Advances in Neural Information Processing Systems．Long Beach：MIT Press，2017：6000-6010.
[35]	GULATI A， QIN J， CHIU C C，et al ．Conformer：convolution-augmented transformer for speech recognition［EB/OL］．（2020-05-16）［2025-05-05］．.
[36]	CHEN J， TAN X， LUAN J，et al ．Hifisinger：towards high-fidelity neural singing voice synthesis［EB/OL］．（2020-09-03）［2025-05-05］．.

数据集	标注信息	总时长/h	音素级标注时长/h
京剧纯人声^［9］	音素	7.00	1.70
闽南歌仔戏^［10］	部分音素	4.54	4.54
5种中西方戏剧^［17］	剧种	2.04
14种中国戏剧^［18］	剧种	10.23
粤剧纯人声^［19］	角色	2.50

演唱者ID	姓名	性别	音域	时长/h
1	红线女	女	E3-C6（52-84）	0.92
2	罗家宝	男	G2-C_#5（43-73）	0.95
3	陈笑风	男	A2-C_#5（45-73）	1.01
4	文千岁	男	G2-D5（43-74）	0.93

音素	频次		频次		频次		频次
s	1 514	k	136	ak	65	ong	337
m	973	kw	24	e	1 365	ok	141
j	1 367	aa	4 782	ei	514	u	1 623
c	874	aai	151	eng	36	ui	101
l	880	aau	54	i	4 649	un	104
z	920	aam	69	iu	398	ung	498
h	758	aan	356	im	85	uk	194
f	489	aang	101	in	456	oe	758
g	756	aak	77	ing	599	oeng	446
w	386	ai	389	ip	34	oek	50
n	446	au	530	it	80	ng	2 799
t	402	am	336	ik	106	eoi	368
b	544	an	552	o	3 244	eon	82
d	476	ang	133	oi	404	yu	977
gw	159	ap	57	ou	407	yun	233
p	147	at	355	on	66	yut	98

模型	发音准确度	韵律自然度	情感传达力	合成纯净度
GT	4.52 ± 0.03	4.38 ± 0.04	4.42 ± 0.03	4.44 ± 0.03
FT-GAN	4.13 ± 0.03	4.12 ± 0.03	4.16 ± 0.04	4.10 ± 0.03
FastSpeech 2	4.06 ± 0.03	3.86 ± 0.03	3.79 ± 0.04	4.14 ± 0.03
DiffSinger	3.98 ± 0.03	3.91 ± 0.04	3.75 ± 0.03	3.85 ± 0.03