电子、通信与自动控制

一种基于时域全面注意力机制的单通道语音分离模型

  • 杨俊美 ,
  • 张邦成 ,
  • 杨璐 ,
  • 曾徳炉
展开
  • 华南理工大学 电子与信息学院,广东 广州 510640
杨俊美(1979—),女,博士,副教授,主要从事智能信号处理、自适应滤波、图像超分辨率重建、语音去混响等研究。E-mail: yjunmei@scut.edu.cn

收稿日期: 2025-03-03

  网络出版日期: 2025-07-18

基金资助

广东省自然科学基金项目(2023A1515011281)

A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism

  • YANG Junmei ,
  • ZHANG Bangcheng ,
  • YANG Lu ,
  • ZENG Delu
Expand
  • School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China

Received date: 2025-03-03

  Online published: 2025-07-18

Supported by

the Natural Science Foundation of Guangdong Province(2023A1515011281)

摘要

单通道语音分离旨在从单一麦克风采集的混合语音中分离出目标说话人的纯净语音,在智能家居、会议系统和助听设备等场景具有重要应用价值。随着深度学习技术的快速发展,基于自注意力网络的单通道语音分离技术取得显著进展。尽管自注意力网络在捕捉长序列上下文信息方面表现优异,但其对实际语音场景中时间/频谱连续性、频谱结构和音色等细节特征的捕捉仍存在局限。此外,现有基于单一注意力范式的分离架构难以实现多尺度特征的有效融合。该文提出了一种时域全面注意力网络(TCANet)模型,通过局部与全局注意力模块的协同设计,针对性地解决单通道语音分离中的上述问题。局部建模采用S&C-SENet增强的Conformer结构,以精细提取语音频谱结构、音色等短时特征;全局建模则构建含相对位置嵌入的改进型Transformer模块,显式学习语音长时依赖关系;同时,TCANet通过维度变换机制实现局部块内特征与全局块间关联的跨尺度融合。在基准数据集LRS2-2Mix、Libri2Mix和EchoSet上的实验结果表明,该方法在尺度不变信噪比改善(SI-SNRi)和信号失真比改善(SDRi)指标上均优于现有端到端语音分离方法。

本文引用格式

杨俊美 , 张邦成 , 杨璐 , 曾徳炉 . 一种基于时域全面注意力机制的单通道语音分离模型[J]. 华南理工大学学报(自然科学版), 2026 , 54(1) : 70 -82 . DOI: 10.12141/j.issn.1000-565X.250054

Abstract

Single-channel speech separation aims to extract clean target speaker speech from a mixed audio signal recorded by a single microphone, with significant application value in scenarios such as smart homes, conference systems, and hearing aids. With the rapid development of deep learning technology, self-attention network-based approaches to single-channel speech separation have achieved remarkable progress. While self-attention networks excel at capturing contextual information in long sequence, they still exhibit limitations in capturing detailed features such as temporal/spectral continuity, spectral structure, and timbre in real-world speech scenarios. Moreover, existing separation architectures based on a single attention paradigm struggle to achieve effective multi-scale feature fusion. To address these challenges, this paper proposed a Temporal Comprehensive Attention Network (TCANet), which addresses the aforementioned issues through a synergistic design of local and global attention modules. Local modeling employs an S&C-SENet-enhanced Conformer structure to capture short-term features such as spectral structure and timbre in detail, while global modeling incorporates a modified Transformer module with relative position embedding to explicitly learn long-term speech dependencies in speech. Furthermore, TCANet achieves cross-scale fusion of intra-block local features and inter-block global correlations through a dimension transformation mechanism. Experimental results on three benchmark datasets—LRS2-2Mix, Libri2Mix, and EchoSet—demonstrate that the proposed method outperforms existing end-to-end speech separation approaches in terms of scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improvement (SDRi).

参考文献

[1] AGRAWAL J, GUPTA M, GARG H .A review on speech separation in cocktail party environment:cha-llenges and approaches[J].Multimedia Tools and Applications202382(20):31035-31067.
[2] XUY, DU J, DAI L,et al .A regression approach to speech enhancement based on deep neural networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing201523(1):7-19.
[3] HUANG P, KIM M, HASEGAWA-JOHNSON M,et al .Joint optimization of masks and deep recurrent neural networks for monaural source separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Proce-ssing,201523(12):2136-2147.
[4] HERSHEY J R, CHEN Z, LE ROUX J,et al .Deep clustering:discriminative embeddings for segmentation and separation[C]∥ Proceedings of 2016 IEEE International Conference on Acoustics,Speech and Signal Processing.Shanghai:IEEE,2016:31-35.
[5] PANDEY A, WANG D .TCNN:temporal convolutional neural network for real-time speech enhancement in the time domain[C]∥ Proceedings of 2019 IEEE International Conference on Acoustics,Speech and Signal Processing.Brighton:IEEE,2019:6875-6879.
[6] LUO Y, MESGARANI N .TasNet:time-domain audio separation network for real-time,single-channel speech separation[C]∥ Proceedings of 2018 IEEE International Conference on Acoustics,Speech and Signal Processing.Calgary:IEEE,2018:696-700.
[7] LUO Y, MESGARANI N .Conv-TasNet:surpassing ideal time-frequency magnitude masking for speech separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing201927(8):1256-1266.
[8] LUO Y, CHEN Z, YOSHIOKA T .Dual-path RNN:efficient long sequence modeling for time-domain single-channel speech separation[C]∥ Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing.Barcelona:IEEE,2020:46-50.
[9] CHEN J, MAO Q, LIU D .Dual-path transformer network:direct context-aware modeling for end-to-end mo-naural speech separation[C]∥ Proceedings of the 21st Annual Conference of the International Speech Communication Association.[S.l.]:ISCA,2020:2642-2646.
[10] SUBAKAN C, RAVANELLI M, CORNELL S,et al .Attention is all you need in speech separation[C]∥ Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing.Toronto:IEEE,2021:21-25.
[11] LAM M W Y, WANG J, SU D,et al .Effective low-cost time-domain audio separation using globally attentive locally recurrent networks[C]∥ Proceedings of 2021 IEEE Spoken Language Technology Workshop.Shenzhen:IEEE,2021:801-808.
[12] LI C, YANG L, WANG W,et al .SkiM:skipping memory LSTM for low-latency real-time continuous speech separation[C]∥ Proceedings of 2022 IEEE International Conference on Acoustics,Speech and Signal Processing.Singapore:IEEE,2022:681-685.
[13] DELLA LIBERA L, SUBAKAN C, RAVANELLI M,et al .Resource-efficient separation transformer[C]∥ Proceedings of 2024 IEEE International Conference on Acoustics,Speech and Signal Processing.Seoul:IEEE,2024:761-765.
[14] ZHAO S, MA B .MossFormer:pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions[C]∥ Proceedings of 2023 IEEE International Conference on Acoustics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10096646/1-5.
[15] SHIN U, LEE S, KIM T,et al .Separate and reconstruct:asymmetric encoder-decoder for speech separation[C]∥ Advances in Neural Information Processing Systems 37:38th Conference on Neural Information Processing Systems.Red Hook:Curran Associates, Inc.,2024:52215-52240.
[16] 孙林慧,王春艳,张蒙 .基于全卷积神经网络多任务学习的时域语音分离[J].信号处理202440(12):2228-2237.
  SUN Linhui, WANG Chunyan, ZHANG Meng .Time-domain speech separation based on a fully convolutional neural network with multitask learning[J].Journal of Signal Processing202440(12):2228-2237.
[17] 余传旗,郭海燕,王婷婷,等 .基于图注意力网络和门控网络的轻量级单通道语音分离方法[J].信号处理202541(4):706-717.
  YU Chuanqi, GUO Haiyan, WANG Tingting,et al .A lightweight single channel speech separation method based on graph attention networks and gated network[J].Journal of Signal Processing202541(4):706-717.
[18] 曹毅,王彦雯,李杰,等 .基于减小高频混响和RF-DRSN-EMA的声音事件分类方法[J].华南理工大学学报(自然科学版)202553(7):70-79.
  CAO Yi, WANG Yanwen, LI Jie,et al .Acoustic scene classification method based on reducing high-frequency reverberation and RF-DRSN-EMA[J].Journal of South China University of Technology (Natural Science Edition)202553(7):70-79.
[19] GULATIA, QIN J, CHIU C,et al .Conformer:convolution-augmented transformer for speech recognition[C]∥ Proceedings of the 21st Annual Conference of the International Speech Communication Association.[S.l.]:ISCA, 2020:5036-5040.
[20] CHEN S, WU Y, CHEN Z,et al .Continuous speech separation with Conformer[C]∥ Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing.Toronto:IEEE,2021:5749-5753.
[21] LI C, WANG Y, DENG F,et al .EAD-Conformer:a Conformer-based encoder-attention-decoder-network for multi-task audio source separation[C]∥ Procee-dings of 2022 IEEE International Conference on Acou-stics,Speech and Signal Processing.Singapore:IEEE,2022:521-525.
[22] SCHEIBLER R, JI Y, CHUNG S,et al .Diffusion-based generative speech source separation[C]∥ Proceedings of 2023 IEEE International Conference on Acoustics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10095310/1-5.
[23] CHEN B, WU C, ZHAO W .SEPDIFF:speech separation based on denoising diffusion model[C]∥ Procee-dings of 2023 IEEE International Conference on Acou-stics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10095979/1-5.
[24] LEE S, JUNG C, JANG Y,et al .Seeing through the conversation:audio-visual speech separation based on diffusion model[C]∥ Proceedings of 2024 IEEE International Conference on Acoustics,Speech and Signal Processing.Seoul:IEEE,2024:12632-12636.
[25] GU A,DAO T .Mamba:linear-time sequence mode-ling with selective state spaces[EB/OL].(2023-12-18)[2025-02-16]..
[26] DAO T, GU A .Transformers are SSMs:generalized models and efficient algorithms through structured state space duality[C]∥ Proceedings of the 41st International Conference on Machine Learning.Vienna:ML Research Press,2024:1-31.
[27] LI K, CHEN G .SPMamba:state-space model is all you need in speech separation[EB/OL].(2024-04-02) [2025-02-16]..
[28] WANG Z, CORNELL S, CHOI S,et al .TF-GridNet:making time-frequency domain models great again for monaural speaker separation[C]∥ Procee-dings of 2023 IEEE International Conference on Acou-stics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10094992/1-5.
[29] JIANG X, HAN C, MESGARANI N .Dual-path Mamba:short and long-term bidirectional selective structured state space models for speech separation[C]∥ Proceedings of 2025 IEEE International Confe-rence on Acoustics,Speech and Signal Processing.Hyderabad:IEEE,2025:10888514/1-5.
[30] LI K, YANG R, HU X .An efficient encoder-decoder architecture with top-down attention for speech separation[C]∥ Proceedings of the 11th International Confe-rence on Learning Representations.Kigali:OpenReview.net,2023:3495/1-16.
[31] SUBAKANC, RAVANELLI M, CORNELL S,et al .Exploring self-attention mechanisms for speech separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing202331:2169-2180.
[32] ROY A G, NAVAB N, WACHINGER C .Concurrent spatial and channel‘squeeze & excitation’ in fully convolutional networks[C]∥ Proceedings of the 21st International Conference on Medical Image Computing and Computer Assisted Intervention.Granada:Springer,2018:421-429.
[33] SHAW P, USZKOREIT J, VASWANI A .Self-attention with relative position representations[C]∥ Procee-dings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.New Orleans:ACL,2018:464-468.
[34] VASWANI A, SHAZEER N, PARMAR N, et al .Attention is all you need[C]∥ Advances in Neural Information Processing Systems 30:31st Conference on Neural Information Processing Systems.Red Hook:Curran Associates Inc.,2017:6000-6010.
[35] COSENTINO J, PARIENTE M, CORNELL S,et al .LibriMix:an open-source dataset for generalizable speech separation[EB/OL].(2020-05-22)[2025-02-16]..
[36] XU M, LI K, CHEN G,et al .TIGER:time-frequency interleaved gain extraction and reconstruction for efficient speech separation[C]∥ Proceedings of the 13th International Conference on Learning Representations.Singapore:OpenReview,2025:5539/1-18.
[37] PANAYOTOV V, CHEN G, POVEY D,et al .LibriSpeech:an ASR corpus based on public domain audio books[C]∥ Proceedings of 2015 IEEE International Conference on Acoustics,Speech and Signal Processing.South Brisbane:IEEE,2015:5206-5210.
[38] AFOURAS T, CHUNG J S, SENIOR A,et al .Deep audio-visual speech recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence202244(12):8717-8727.
[39] RAVANELLI M, PARCOLLET T, MOUMEN A,et al .Open-source conversational AI with SpeechBrain 1.0[J].Journal of Machine Learning Research202425:333/1-11.
[40] TZINIS E, WANG Z, SMARAGDIS P .Sudo RM-RF:efficient networks for universal audio source separation[C]∥ Proceedings of 2020 IEEE the 30th International Workshop on Machine Learning for Signal Processing.Espoo:IEEE,2020:9231900/1-6.
[41] HU X, LI K, ZHANG W,et al .Speech separation using an asynchronous fully recurrent convolutional neural network[C]∥ Advances in Neural Information Processing Systems 34:35th Conference on Neural Information Processing Systems.Red Hook:Curran Associates Inc.,2021:22509-22522.
文章导航

/