华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (7): 1-.doi: 10.12141/j.issn.1000-565X.240508

• 电子、通信与自动控制 •    

基于减小高频混响和RF-DRSN-EMA的声音事件分类方法

曹毅  王彦雯  李杰  郑植  孙浩   

  1.  江南大学机械工程学院/江苏省食品先进制造装备技术重点实验室,江苏 无锡 214122

  • 出版日期:2025-07-25 发布日期:2025-01-17

Acoustic Scene Classification Method Based On Reducing High Frequency Reverberation And RF-DRSN-EMA

CAO Yi   WANG Yanwen   LI Jie   ZHENG Zhi   SUN Hao   

  1. School of Mechanical Engineering/ Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and Technology, Wuxi 214122, Jiangsu, China

  • Online:2025-07-25 Published:2025-01-17

摘要:

针对现有方法进行声音事件分类研究时其分类准确率不高、泛化能力不强的问题,提出了一种基于减小高频混响和多尺度注意力的频域残差收缩网络(RF-DRSN-EMA)的声音事件方法。首先介绍了减小声音混响的原理,提出了一种减小高频混响的算法,通过仅减小分离出的音频高频段混响而保留其余频段中的关键频率信息,从而在提升语音清晰度的同时尽可能减小语音失真的影响;其次,以深度残差收缩网络为基础网络,结合改进的频域自校正算法和多尺度注意力模块,提出了多尺度注意力的频域残差收缩网络RF-DRSN-EMA。该模型采用RF自校正块,其内部的长短距离残差结构能缓解特征坍塌,以期实现频域信息的高效采集,并在单元的输出采用多尺度注意力模块,其能进一步关注单元在输出层的有效信息,从而强化模型的表征能力。最后,基于ESC-10、Urbansound8K、DCASE2020 Task 1A数据集开展了声音事件分类实验。实验结果表明:减小高频混响的语音增强方法能针对性减小高频段混响等背景噪音影响和消除冗余特征的同时,且音质损伤较小,从而具有更好的分类性能;同时RF-DRSN-EMA实现了网络中频域的典型特征去噪以及信息的高效采集,模型最佳分类准确率分别达到了98.00%、93.42%、72.80%,验证了该方法的有效性和泛化性。

关键词: 声音事件分类, 减小高频混响, 频域残差收缩网络, 多尺度注意力, 语音增强

Abstract:

Aiming at the problems of low classification accuracy and weak generalization ability of existing methods for acoustic scene classification, this paper proposed an acoustic scene classification method based on reducing high frequency reverberation and frequency domain residual shrinkage network for multi-scale attention(RF-DRSN-EMA). Firstly, the underlying principles for reducing sound reverberation were presented, along with a proposed algorithm specifically reducing high-frequency reverberation. This algorithm effectively attenuated only the high-frequency reverberation while preserving essential frequency information in other bands. As a result, speech intelligibility was enhanced, and the impact of speech distortion was minimized. Secondly, based on the deep residual shrinkage network and combined with the improved frequency domain self-calibration algorithm and the multi-scale attention module, a frequency domain residual shrinkage network for multi-scale attention(RF-DRSN-EMA) was proposed. The model used RF self calibration block, whose internal long-distance and short-distance residual structure can alleviate feature collapse. In order to achieve efficient collection of frequency domain information, multi-scale attention module was used in the output of the unit, which can further focus on the effective information of the unit at the output layer, thus strengthening the representation ability of the model. Finally, the experiments were carried out based on ESC-10, Urbansound8K and DCASE2020 Task 1A data datasets. The experimental results showed that the speech enhancement method to reduce high-frequency reverberation can reduce the impact of background noise such as high frequency reverberation and eliminate redundant features, and the sound quality damage is small, thereby showing a better classification performance. At the same time, RF-DRSN-EMA realizes the typical feature denoising and efficient information collection in the frequency domain of the network, and the best classification accuracy of the model can reach 98.00%, 93.42% and 72.80%, respectively, which verifies the effectiveness and generalization of network.


Key words: acoustic scene classification, reduce high frequency reverberation, the frequency domain residual shrinkage network, multi-scale attention, speech enhancement