面向语音识别增强的量子衍生最优分数阶声谱图
1.福州大学 物理与信息工程学院,福建 福州 350108;
2.福州大学 至诚学院,福建 福州 350002;
3.福建师范大学 音乐学院,福建 福州 350108
网络出版日期: 2025-12-22
Quantum-Inspired Optimal Fractional-Order Spectrogram for Speech Recognition Enhancement
1.School of Physics and Information Engineering, Fuzhou University, Fuzhou 350108, Fujian, China;
2. School of Zhicheng College, Fuzhou University, Fuzhou 350002, Fujian, China;
3. School of Music, Fujian Normal University, Fuzhou 350108, Fujian, China
Online published: 2025-12-22
针对传统声谱图在语音信号时频表示中分辨率不足、特征判别性弱的问题,提出一种基于量子衍生牛顿-拉夫逊算法的最优分数阶声谱图生成方法,以提升语音识别与分类任务的性能。首先,采用量子编码对牛顿-拉夫逊算法种群初始化,通过量子旋转门使个体朝最优方向靠拢,引入量子变异和灾变以激活种群多样性并避免早熟收敛,并结合模拟退火和Lévy飞行策略,增强全局最优搜索能力;然后,对音频信号加窗、分帧后,通过分数阶傅里叶变换生成分数阶声谱图,并通过Mel滤波器进行尺度压缩得到分数阶Mel声谱图;最后,引入可调节的分数阶参数α,扩展信号在时频域的表示自由度,并以信息熵最小化为目标函数,利用量子衍生牛顿-拉夫逊算法对α、帧长、帧移等超参数进行自适应优化,获取最优分数阶声谱图。在2022CEC标准测试函数,公用的RAVDESS情感识别和UrbanSound8K声音分类数据集,以及自建歌唱母音发音数据集上进行实验仿真测试,结果表明:量子衍生牛顿-拉夫逊算法较现有优化算法具有更强的全局寻优能力,解决高维复杂问题的稳定性较高;生成的最优分数阶声谱图能有效聚焦信号能量,增强特征可分性,在语音识别的准确率、召回率和F1得分上均显著优于传统语音特征提取方法。本文方法为复杂语音信号的高精度特征提取提供了新思路,可有效增强语音识别效果,具有良好的鲁棒性与应用前景。
孙磊, 章先恒, 廖一鹏, 等 . 面向语音识别增强的量子衍生最优分数阶声谱图[J]. 华南理工大学学报(自然科学版), 0 : 1 . DOI: 10.12141/j.issn.1000-565X.250324
To address the insufficient resolution and weak feature discriminability of conventional spectrograms in time-frequency representation of speech signals, this paper proposes an optimal fractional-order spectrogram generation method based on a quantum-inspired Newton-Raphson algorithm to enhance the performance of speech recognition and classification tasks. First, quantum encoding is employed to initialize the population in the Newton-Raphson algorithm. Quantum rotation gates guide individuals toward the optimal direction, while quantum mutation and catastrophe operations are introduced to maintain population diversity and prevent premature convergence. Furthermore, the integration of simulated annealing and Lévy flight strategies enhances the algorithm's global search capability. Next, after windowing and framing the audio signal, fractional-order spectrograms are generated via the fractional Fourier transform and subsequently compressed into fractional-order Mel spectrograms using Mel filters. Finally, a tunable fractional-order parameter α is introduced to extend the representation flexibility of signals in the time-frequency domain. By minimizing information entropy as the objective function, the quantum-inspired Newton-Raphson algorithm adaptively optimizes hyperparameters including α, frame length, and frame shift, thereby obtaining the optimal fractional-order spectrogram. Experimental simulations are conducted on the 2022 CEC benchmark functions, as well as public datasets RAVDESS for emotion recognition and UrbanSound8K for sound classification, along with a self-collected dataset of sung vowel phonations. Results demonstrate that the proposed quantum-inspired Newton-Raphson algorithm exhibits superior global optimization capability and higher stability in solving high-dimensional complex problems compared to existing optimization algorithms. The generated optimal fractional-order spectrograms effectively concentrate signal energy and enhance feature separability, significantly outperforming traditional speech feature extraction methods in terms of accuracy, recall, and F1-score for speech recognition. This work provides a novel approach for high-precision feature extraction of complex speech signals, effectively improving speech recognition performance with strong robustness and promising applicability.
/
| 〈 |
|
〉 |