华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (7): 70-79.doi: 10.12141/j.issn.1000-565X.240508

• 电子、通信与自动控制 • 上一篇    下一篇

基于减小高频混响和RF-DRSN-EMA的声音事件分类方法

曹毅, 王彦雯, 李杰, 郑植, 孙浩   

  1. 江南大学 机械工程学院/江苏省食品先进制造装备技术重点实验室,江苏 无锡 214122
  • 收稿日期:2024-10-14 出版日期:2025-07-25 发布日期:2025-01-17
  • 作者简介:曹毅(1974—),男,博士,教授,主要从事语音识别技术研究。E-mail: caoyi@jiangnan.edu.cn
  • 基金资助:
    国家自然科学基金项目(52175234);高等学校学科创新引智计划项目(B18027)

Acoustic Scene Classification Method Based on Reducing High-Frequency Reverberation and RF-DRSN-EMA

CAO Yi, WANG Yanwen, LI Jie, ZHENG Zhi, SUN Hao   

  1. School of Mechanical Engineering/ Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and Technology,Jiangnan University,Wuxi 214122,Jiangsu,China
  • Received:2024-10-14 Online:2025-07-25 Published:2025-01-17
  • About author:曹毅(1974—),男,博士,教授,主要从事语音识别技术研究。E-mail: caoyi@jiangnan.edu.cn
  • Supported by:
    the National Natural Science Foundation of China(52175234);the Programme of Introducing Talents of Discipline to Universities(B18027)

摘要:

针对现有声音事件分类方法分类准确率不高、泛化能力不强的问题,该文提出了一种基于减小高频混响和多尺度注意力的频域残差收缩网络(RF-DRSN-EMA)的声音事件分类方法。首先,根据减小声音混响的原理,提出了一种减小高频混响的方法,该方法仅减小分离出音频中的高频段混响而保留其余频段的关键频率信息,以便在提升语音清晰度的同时尽可能地减小语音失真的影响。然后,以深度残差收缩网络为基础网络,结合改进的频域自校正算法和多尺度注意力模块,提出了多尺度注意力的频域残差收缩网络RF-DRSN-EMA;该网络采用RF自校正模块(其内部的长短距离残差结构能缓解特征坍塌),以实现频域信息的高效采集,并在单元的输出采用多尺度注意力模块,进一步关注单元在输出层的有效信息,以强化模型的表征能力。最后,基于数据集ESC-10、UrbanSound8K和DCASE2020 Task 1A开展了声音事件分类实验。结果表明:该文提出的减小高频混响的语音增强方法能有针对性地减小高频段混响等背景噪音的影响和消除冗余特征,音质损伤较小,从而具有更好的分类性能;RF-DRSN-EMA实现了网络中频域的典型特征去噪以及信息的高效采集,在3个数据集上的最佳分类准确率分别达到98.00%、93.42%、72.80%,从而验证了该方法的有效性和泛化性。

关键词: 声音事件分类, 减小高频混响, 频域残差收缩网络, 多尺度注意力, 语音增强

Abstract:

To address the issues of low classification accuracy and poor generalization in existing acoustic scene classification methods, this paper proposed a novel acoustic scene classification method based on reducing high-frequency reverberation and a frequency-domain residual shrinkage network with multi-scale attention, named RF-DRSN-EMA. Firstly, according to the principle of reducing sound reverberation, this paper introduced a redu-cing high-frequency reverberation method. This method attenuated only the high-frequency reverberation while preserving essential frequency information in other bands. As a result, speech intelligibility was enhanced, and the impact of speech distortion was minimized. Secondly, based on the deep residual shrinkage network, the proposed RF-DRSN-EMA integrates an improved frequency-domain self-calibration mechanism and a multi-scale attention module. The network used RF self-calibration module with a long-short residual structure to mitigate feature collapse, enabling efficient extraction of frequency-domain information. A multi-scale attention module was then applied at the output of each unit to highlight relevant information, further enhancing the model’s representation capacity. Finally, the proposed method is evaluated on three benchmark datasets: ESC-10, UrbanSound8K, and DCASE2020 Task 1A. The results show that the proposed high-frequency reverberation reduction method effectively suppresses high-frequency reverberation and background noise while eliminating redundant features, resulting in minimal speech quality degradation. The RF-DRSN-EMA network achieves efficient frequency-domain denoising and feature extraction, reaching classification accuracies of 98.00%, 93.42%, and 72.80% on the three datasets, respectively. These results confirm the effectiveness and generalizability of the proposed method.

Key words: acoustic scene classification, reducing high-frequency reverberation, frequency-domain residual shrinkage network, multi-scale attention, speech enhancement

中图分类号: