华南理工大学学报(自然科学版) ›› 2026, Vol. 54 ›› Issue (1): 70-82.doi: 10.12141/j.issn.1000-565X.250054
收稿日期:2025-03-03
出版日期:2026-01-10
发布日期:2025-07-18
作者简介:杨俊美(1979—),女,博士,副教授,主要从事智能信号处理、自适应滤波、图像超分辨率重建、语音去混响等研究。E-mail: yjunmei@scut.edu.cn
基金资助:
YANG Junmei(
), ZHANG Bangcheng, YANG Lu, ZENG Delu
Received:2025-03-03
Online:2026-01-10
Published:2025-07-18
Supported by:摘要:
单通道语音分离旨在从单一麦克风采集的混合语音中分离出目标说话人的纯净语音,在智能家居、会议系统和助听设备等场景具有重要应用价值。随着深度学习技术的快速发展,基于自注意力网络的单通道语音分离技术取得显著进展。尽管自注意力网络在捕捉长序列上下文信息方面表现优异,但其对实际语音场景中时间/频谱连续性、频谱结构和音色等细节特征的捕捉仍存在局限。此外,现有基于单一注意力范式的分离架构难以实现多尺度特征的有效融合。该文提出了一种时域全面注意力网络(TCANet)模型,通过局部与全局注意力模块的协同设计,针对性地解决单通道语音分离中的上述问题。局部建模采用S&C-SENet增强的Conformer结构,以精细提取语音频谱结构、音色等短时特征;全局建模则构建含相对位置嵌入的改进型Transformer模块,显式学习语音长时依赖关系;同时,TCANet通过维度变换机制实现局部块内特征与全局块间关联的跨尺度融合。在基准数据集LRS2-2Mix、Libri2Mix和EchoSet上的实验结果表明,该方法在尺度不变信噪比改善(SI-SNRi)和信号失真比改善(SDRi)指标上均优于现有端到端语音分离方法。
中图分类号:
杨俊美, 张邦成, 杨璐, 曾徳炉. 一种基于时域全面注意力机制的单通道语音分离模型[J]. 华南理工大学学报(自然科学版), 2026, 54(1): 70-82.
YANG Junmei, ZHANG Bangcheng, YANG Lu, ZENG Delu. A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism[J]. Journal of South China University of Technology(Natural Science Edition), 2026, 54(1): 70-82.
表1
10种模型的实验结果对比"
| 模型 | ISI-SNRi/dB | ISDRi/dB | 计算速度/GFLOPS | 参数量/106 | ||||
|---|---|---|---|---|---|---|---|---|
| Libri2Mix | LRS2-2Mix | EchoSet | Libri2Mix | LRS2-2Mix | EchoSet | |||
| BLSTM-TasNet | 7.9 | 6.1 | 5.2 | 8.7 | 6.8 | 4.3 | 23.1 | 23.6 |
| Conv-TasNet | 12.2 | 10.6 | 7.7 | 12.7 | 11.0 | 6.9 | 6.3 | 5.6 |
| SuDoRM-RF1.0x | 13.5 | 11.0 | 7.7 | 14.0 | 11.4 | 6.8 | 2.5 | 2.7 |
| SuDoRM-RF2.5x | 14.0 | 11.3 | 8.1 | 14.4 | 11.7 | 7.0 | 19.8 | 6.4 |
| DPRNN | 16.1 | 12.7 | 5.9 | 16.6 | 13.0 | 5.1 | 125.3 | 2.7 |
| DPTNet | 16.7 | 13.3 | 8.9 | 17.1 | 13.6 | 8.1 | 171.1 | 2.7 |
| A-FRCNN-16 | 16.7 | 13.0 | 9.6 | 17.2 | 13.3 | 8.8 | 22.8 | 6.1 |
| TDANet | 16.9 | 13.2 | 10.1 | 17.4 | 13.5 | 9.2 | 9.3 | 2.3 |
| SepFormer | 16.5 | 13.5 | 9.7 | 17.0 | 13.8 | 8.7 | 145.6 | 26.0 |
| TCANet | 17.0 | 14.1 | 9.9 | 17.3 | 14.5 | 9.1 | 108.3 | 20.0 |
| [1] | AGRAWAL J, GUPTA M, GARG H .A review on speech separation in cocktail party environment:cha-llenges and approaches[J].Multimedia Tools and Applications,2023,82(20):31035-31067. |
| [2] | XUY, DU J, DAI L,et al .A regression approach to speech enhancement based on deep neural networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2015,23(1):7-19. |
| [3] | HUANG P, KIM M, HASEGAWA-JOHNSON M,et al .Joint optimization of masks and deep recurrent neural networks for monaural source separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Proce-ssing,2015,23(12):2136-2147. |
| [4] | HERSHEY J R, CHEN Z, LE ROUX J,et al .Deep clustering:discriminative embeddings for segmentation and separation[C]∥ Proceedings of 2016 IEEE International Conference on Acoustics,Speech and Signal Processing.Shanghai:IEEE,2016:31-35. |
| [5] | PANDEY A, WANG D .TCNN:temporal convolutional neural network for real-time speech enhancement in the time domain[C]∥ Proceedings of 2019 IEEE International Conference on Acoustics,Speech and Signal Processing.Brighton:IEEE,2019:6875-6879. |
| [6] | LUO Y, MESGARANI N .TasNet:time-domain audio separation network for real-time,single-channel speech separation[C]∥ Proceedings of 2018 IEEE International Conference on Acoustics,Speech and Signal Processing.Calgary:IEEE,2018:696-700. |
| [7] | LUO Y, MESGARANI N .Conv-TasNet:surpassing ideal time-frequency magnitude masking for speech separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2019,27(8):1256-1266. |
| [8] | LUO Y, CHEN Z, YOSHIOKA T .Dual-path RNN:efficient long sequence modeling for time-domain single-channel speech separation[C]∥ Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing.Barcelona:IEEE,2020:46-50. |
| [9] | CHEN J, MAO Q, LIU D .Dual-path transformer network:direct context-aware modeling for end-to-end mo-naural speech separation[C]∥ Proceedings of the 21st Annual Conference of the International Speech Communication Association.[S.l.]:ISCA,2020:2642-2646. |
| [10] | SUBAKAN C, RAVANELLI M, CORNELL S,et al .Attention is all you need in speech separation[C]∥ Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing.Toronto:IEEE,2021:21-25. |
| [11] | LAM M W Y, WANG J, SU D,et al .Effective low-cost time-domain audio separation using globally attentive locally recurrent networks[C]∥ Proceedings of 2021 IEEE Spoken Language Technology Workshop.Shenzhen:IEEE,2021:801-808. |
| [12] | LI C, YANG L, WANG W,et al .SkiM:skipping memory LSTM for low-latency real-time continuous speech separation[C]∥ Proceedings of 2022 IEEE International Conference on Acoustics,Speech and Signal Processing.Singapore:IEEE,2022:681-685. |
| [13] | DELLA LIBERA L, SUBAKAN C, RAVANELLI M,et al .Resource-efficient separation transformer[C]∥ Proceedings of 2024 IEEE International Conference on Acoustics,Speech and Signal Processing.Seoul:IEEE,2024:761-765. |
| [14] | ZHAO S, MA B .MossFormer:pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions[C]∥ Proceedings of 2023 IEEE International Conference on Acoustics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10096646/1-5. |
| [15] | SHIN U, LEE S, KIM T,et al .Separate and reconstruct:asymmetric encoder-decoder for speech separation[C]∥ Advances in Neural Information Processing Systems 37:38th Conference on Neural Information Processing Systems.Red Hook:Curran Associates, Inc.,2024:52215-52240. |
| [16] | 孙林慧,王春艳,张蒙 .基于全卷积神经网络多任务学习的时域语音分离[J].信号处理,2024,40(12):2228-2237. |
| SUN Linhui, WANG Chunyan, ZHANG Meng .Time-domain speech separation based on a fully convolutional neural network with multitask learning[J].Journal of Signal Processing,2024,40(12):2228-2237. | |
| [17] | 余传旗,郭海燕,王婷婷,等 .基于图注意力网络和门控网络的轻量级单通道语音分离方法[J].信号处理,2025,41(4):706-717. |
| YU Chuanqi, GUO Haiyan, WANG Tingting,et al .A lightweight single channel speech separation method based on graph attention networks and gated network[J].Journal of Signal Processing,2025,41(4):706-717. | |
| [18] | 曹毅,王彦雯,李杰,等 .基于减小高频混响和RF-DRSN-EMA的声音事件分类方法[J].华南理工大学学报(自然科学版),2025,53(7):70-79. |
| CAO Yi, WANG Yanwen, LI Jie,et al .Acoustic scene classification method based on reducing high-frequency reverberation and RF-DRSN-EMA[J].Journal of South China University of Technology (Natural Science Edition),2025,53(7):70-79. | |
| [19] | GULATIA, QIN J, CHIU C,et al .Conformer:convolution-augmented transformer for speech recognition[C]∥ Proceedings of the 21st Annual Conference of the International Speech Communication Association.[S.l.]:ISCA, 2020:5036-5040. |
| [20] | CHEN S, WU Y, CHEN Z,et al .Continuous speech separation with Conformer[C]∥ Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing.Toronto:IEEE,2021:5749-5753. |
| [21] | LI C, WANG Y, DENG F,et al .EAD-Conformer:a Conformer-based encoder-attention-decoder-network for multi-task audio source separation[C]∥ Procee-dings of 2022 IEEE International Conference on Acou-stics,Speech and Signal Processing.Singapore:IEEE,2022:521-525. |
| [22] | SCHEIBLER R, JI Y, CHUNG S,et al .Diffusion-based generative speech source separation[C]∥ Proceedings of 2023 IEEE International Conference on Acoustics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10095310/1-5. |
| [23] | CHEN B, WU C, ZHAO W .SEPDIFF:speech separation based on denoising diffusion model[C]∥ Procee-dings of 2023 IEEE International Conference on Acou-stics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10095979/1-5. |
| [24] | LEE S, JUNG C, JANG Y,et al .Seeing through the conversation:audio-visual speech separation based on diffusion model[C]∥ Proceedings of 2024 IEEE International Conference on Acoustics,Speech and Signal Processing.Seoul:IEEE,2024:12632-12636. |
| [25] | GU A,DAO T .Mamba:linear-time sequence mode-ling with selective state spaces[EB/OL].(2023-12-18)[2025-02-16].. |
| [26] | DAO T, GU A .Transformers are SSMs:generalized models and efficient algorithms through structured state space duality[C]∥ Proceedings of the 41st International Conference on Machine Learning.Vienna:ML Research Press,2024:1-31. |
| [27] | LI K, CHEN G .SPMamba:state-space model is all you need in speech separation[EB/OL].(2024-04-02) [2025-02-16].. |
| [28] | WANG Z, CORNELL S, CHOI S,et al .TF-GridNet:making time-frequency domain models great again for monaural speaker separation[C]∥ Procee-dings of 2023 IEEE International Conference on Acou-stics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10094992/1-5. |
| [29] | JIANG X, HAN C, MESGARANI N .Dual-path Mamba:short and long-term bidirectional selective structured state space models for speech separation[C]∥ Proceedings of 2025 IEEE International Confe-rence on Acoustics,Speech and Signal Processing.Hyderabad:IEEE,2025:10888514/1-5. |
| [30] | LI K, YANG R, HU X .An efficient encoder-decoder architecture with top-down attention for speech separation[C]∥ Proceedings of the 11th International Confe-rence on Learning Representations.Kigali:OpenReview.net,2023:3495/1-16. |
| [31] | SUBAKANC, RAVANELLI M, CORNELL S,et al .Exploring self-attention mechanisms for speech separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2023,31:2169-2180. |
| [32] | ROY A G, NAVAB N, WACHINGER C .Concurrent spatial and channel‘squeeze & excitation’ in fully convolutional networks[C]∥ Proceedings of the 21st International Conference on Medical Image Computing and Computer Assisted Intervention.Granada:Springer,2018:421-429. |
| [33] | SHAW P, USZKOREIT J, VASWANI A .Self-attention with relative position representations[C]∥ Procee-dings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.New Orleans:ACL,2018:464-468. |
| [34] | VASWANI A, SHAZEER N, PARMAR N, et al .Attention is all you need[C]∥ Advances in Neural Information Processing Systems 30:31st Conference on Neural Information Processing Systems.Red Hook:Curran Associates Inc.,2017:6000-6010. |
| [35] | COSENTINO J, PARIENTE M, CORNELL S,et al .LibriMix:an open-source dataset for generalizable speech separation[EB/OL].(2020-05-22)[2025-02-16].. |
| [36] | XU M, LI K, CHEN G,et al .TIGER:time-frequency interleaved gain extraction and reconstruction for efficient speech separation[C]∥ Proceedings of the 13th International Conference on Learning Representations.Singapore:OpenReview,2025:5539/1-18. |
| [37] | PANAYOTOV V, CHEN G, POVEY D,et al .LibriSpeech:an ASR corpus based on public domain audio books[C]∥ Proceedings of 2015 IEEE International Conference on Acoustics,Speech and Signal Processing.South Brisbane:IEEE,2015:5206-5210. |
| [38] | AFOURAS T, CHUNG J S, SENIOR A,et al .Deep audio-visual speech recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(12):8717-8727. |
| [39] | RAVANELLI M, PARCOLLET T, MOUMEN A,et al .Open-source conversational AI with SpeechBrain 1.0[J].Journal of Machine Learning Research,2024,25:333/1-11. |
| [40] | TZINIS E, WANG Z, SMARAGDIS P .Sudo RM-RF:efficient networks for universal audio source separation[C]∥ Proceedings of 2020 IEEE the 30th International Workshop on Machine Learning for Signal Processing.Espoo:IEEE,2020:9231900/1-6. |
| [41] | HU X, LI K, ZHANG W,et al .Speech separation using an asynchronous fully recurrent convolutional neural network[C]∥ Advances in Neural Information Processing Systems 34:35th Conference on Neural Information Processing Systems.Red Hook:Curran Associates Inc.,2021:22509-22522. |
| [1] | 王德弘, 张子轩. 基于改进YOLOv5s的输电塔螺栓松动检测[J]. 华南理工大学学报(自然科学版), 2026, 54(2): 25-37. |
| [2] | 凌飞, 顾学荣. 集成机器学习和元启发式算法的靶点抑制剂活性预测[J]. 华南理工大学学报(自然科学版), 2026, 54(2): 91-101. |
| [3] | 陈城, 王淼, 王馨瑶, 高志明, 周璇, 闫军威. 基于LSTM-AE的办公建筑照明插座多工况能耗异常检测方法[J]. 华南理工大学学报(自然科学版), 2025, 53(9): 117-126. |
| [4] | 岳永恒, 赵志浩. 基于深度学习的车道线检测算法[J]. 华南理工大学学报(自然科学版), 2025, 53(9): 22-30. |
| [5] | 左彬, 董天航, 张泽辉, 王华珺, 霍为炜, 宫文峰, 程军圣. 基于深度学习的质子交换膜燃料电池故障预测方法[J]. 华南理工大学学报(自然科学版), 2025, 53(7): 21-30. |
| [6] | 胡广华, 代志刚, 王清辉. 基于图神经网络的B-Rep模型加工特征识别方法[J]. 华南理工大学学报(自然科学版), 2025, 53(5): 20-31. |
| [7] | 胡习之, 崔博非, 王琴, 刘鸿. 基于记忆泊车场景的视觉SLAM算法[J]. 华南理工大学学报(自然科学版), 2024, 52(6): 1-11. |
| [8] | 刘昊, 元辉, 陈晨, 高伟. 基于采样的点云几何编码框架[J]. 华南理工大学学报(自然科学版), 2024, 52(6): 148-156. |
| [9] | 杨春玲, 梁梓文. 特征域近端高维梯度下降图像压缩感知重构网络[J]. 华南理工大学学报(自然科学版), 2024, 52(3): 119-130. |
| [10] | 郑娟毅, 董嘉豪, 张庆珏, 等. 基于残差密集网络的智能超表面信道估计算法[J]. 华南理工大学学报(自然科学版), 2024, 52(3): 102-111. |
| [11] | 周浪, 樊坤, 瞿华, 等. 基于ECA注意力机制改进的EfficientNet-E模型的森林火灾识别[J]. 华南理工大学学报(自然科学版), 2024, 52(2): 42-49. |
| [12] | 陈琼, 冯媛, 李志群, 杨咏. 基于语义-视觉一致性约束的零样本图像语义分割网络[J]. 华南理工大学学报(自然科学版), 2024, 52(10): 41-50. |
| [13] | 刘卫朋, 李旭, 任子文, 祁业东. 多尺度残差可变形肺部CT图像配准算法[J]. 华南理工大学学报(自然科学版), 2024, 52(10): 135-145. |
| [14] | 胡广华, 涂千禧. 基于光度立体和双流特征融合网络的工业产品表面缺陷检测方法[J]. 华南理工大学学报(自然科学版), 2024, 52(10): 112-123. |
| [15] | 李方, 郭炜森, 张平, 等. 基于时空双细胞状态的轴承剩余使用寿命预测方法[J]. 华南理工大学学报(自然科学版), 2023, 51(9): 69-81. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||
地址:广州 五山 华南理工大学17号楼 邮政编码:510640
电话: 020-87111794 邮箱:journal@scut.edu.cn

