华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (9): 1-10.doi: 10.12141/j.issn.1000-565X.250134

• 计算机科学与技术 • 上一篇    下一篇

CODS:用于粤剧人声合成的音频-文本对齐数据集

李粤1, 黄奕翰1, 彭郑威2, 谢吉轩1, 杜宇烨1   

  1. 1.华南理工大学 计算机科学与工程学院,广东 广州 510006
    2.中山大学 计算机学院,广东 广州 510006
  • 收稿日期:2025-05-06 出版日期:2025-09-25 发布日期:2025-05-20
  • 作者简介:李粤(1974—),女,博士,副教授,主要从事人工智能、数据挖掘、计算机科普研究。E-mail:liyue@scut.edu.cn
  • 基金资助:
    国家自然科学基金项目(62476096)

CODS: An Audio-Text Aligned Dataset for Cantonese Opera Vocal Synthesis

LI Yue1, HUANG Yihan1, PENG Zhengwei2, XIE Jixuan1, DU Yuye1   

  1. 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
    2.School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510006,Guangdong,China
  • Received:2025-05-06 Online:2025-09-25 Published:2025-05-20
  • About author:李粤(1974—),女,博士,副教授,主要从事人工智能、数据挖掘、计算机科普研究。E-mail:liyue@scut.edu.cn
  • Supported by:
    the National Natural Science Foundation of China(62476096)

摘要:

中国戏曲文化作为中国传统艺术之一,具有独特的音乐表现力。粤剧是中国主要戏曲剧种之一,是岭南文化的重要载体,被列入世界非物质文化遗产名录。近年来,生成式人工智能技术展现了其在内容创作领域的强大能力,如歌声合成技术能够根据指定乐谱合成自然的歌声,这为粤剧的数字化保护与创新提供了全新思路。然而,戏曲数据的收集与整理面临音频质量不佳、方言标注复杂等问题,导致高质量戏曲数据集极为匮乏。基于此,该文将流行音乐领域的歌声合成技术应用到粤剧人声合成领域,提出了音素级标注的音频-文本对齐的粤剧人声合成数据集(CODS)。首先,通过系统化的流程构建了数据集CODS,该数据集源自4位著名表演者的29部原创作品,总时长为3.81 h,为粤剧研究和数字化提供了重要支持;然后,在该数据集上,采用深度学习方法进行实验,实现了歌词、音色和旋律可控的粤剧人声合成;最后,建立了一套粤剧人声合成评估方案,主客观评价结果达到了领域内良好水平,验证了所制作数据集的可用性。该文构建的数据集CODS成功填补了人工智能在粤剧人声合成领域的空白,有力推动了这一传统艺术的传承与创新。

关键词: 粤剧, 生成式人工智能, 数据集, 人声合成

Abstract:

As one of the traditional Chinese arts, Chinese opera culture has unique musical expressiveness. Cantonese opera, as one of the main Chinese opera genres and an important carrier of Lingnan culture, has been indexed in the World Intangible Cultural Heritage List. In recent years, generative artificial intelligence technology has demonstrated its powerful capabilities in the field of content creation. For example, singing synthesis techno-logy can synthesize natural singing based on specified music scores. This provides a new idea for the digital protection and innovation of Cantonese opera. However, the collection and organization of opera data faces problems such as poor audio quality and complex dialect annotation, resulting in an extreme shortage of high-quality opera data sets. Based on this, this paper applied the singing synthesis technology in the field of pop music to the field of Cantonese opera vocal synthesis, and proposed the first Cantonese opera vocal synthesis dataset with phoneme-level annotation and audio-text alignment. Firstly, this paper constructed the CODS dataset through a systematic process. This dataset was derived from 29 original works by four famous performers with a total length of 3.81 hours, which provides important support for the research and digitization of Cantonese opera. Using this dataset, this paper conducted experiments with a deep learning-based method for Cantonese opera voice synthesis, realizing controllable generation in terms of lyrics, timbre, and melody. Finally, this paper established a comprehensive evaluation framework for Cantonese opera synthesis. Both objective and subjective evaluations reached a satisfactory level within the domain, further validating the usability of the proposed dataset. The CODS dataset constructed in this paper successfully filled the gap in artificial intelligence in the field of Cantonese opera vocal synthesis, and strongly promoted the inheritance and innovation of this traditional art.

Key words: Cantonese opera, generative artificial intelligence, dataset, voice synthesis

中图分类号: