Electronics, Communication & Automation Technology

A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism

  • YANG Junmei ,
  • ZHANG Bangcheng ,
  • YANG Lu ,
  • ZENG Delu
Expand
  • School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China

Received date: 2025-03-03

  Online published: 2025-07-18

Supported by

the Natural Science Foundation of Guangdong Province(2023A1515011281)

Abstract

Single-channel speech separation aims to extract clean target speaker speech from a mixed audio signal recorded by a single microphone, with significant application value in scenarios such as smart homes, conference systems, and hearing aids. With the rapid development of deep learning technology, self-attention network-based approaches to single-channel speech separation have achieved remarkable progress. While self-attention networks excel at capturing contextual information in long sequence, they still exhibit limitations in capturing detailed features such as temporal/spectral continuity, spectral structure, and timbre in real-world speech scenarios. Moreover, existing separation architectures based on a single attention paradigm struggle to achieve effective multi-scale feature fusion. To address these challenges, this paper proposed a Temporal Comprehensive Attention Network (TCANet), which addresses the aforementioned issues through a synergistic design of local and global attention modules. Local modeling employs an S&C-SENet-enhanced Conformer structure to capture short-term features such as spectral structure and timbre in detail, while global modeling incorporates a modified Transformer module with relative position embedding to explicitly learn long-term speech dependencies in speech. Furthermore, TCANet achieves cross-scale fusion of intra-block local features and inter-block global correlations through a dimension transformation mechanism. Experimental results on three benchmark datasets—LRS2-2Mix, Libri2Mix, and EchoSet—demonstrate that the proposed method outperforms existing end-to-end speech separation approaches in terms of scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improvement (SDRi).

Cite this article

YANG Junmei , ZHANG Bangcheng , YANG Lu , ZENG Delu . A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism[J]. Journal of South China University of Technology(Natural Science), 2026 , 54(1) : 70 -82 . DOI: 10.12141/j.issn.1000-565X.250054

References

[1] AGRAWAL J, GUPTA M, GARG H .A review on speech separation in cocktail party environment:cha-llenges and approaches[J].Multimedia Tools and Applications202382(20):31035-31067.
[2] XUY, DU J, DAI L,et al .A regression approach to speech enhancement based on deep neural networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing201523(1):7-19.
[3] HUANG P, KIM M, HASEGAWA-JOHNSON M,et al .Joint optimization of masks and deep recurrent neural networks for monaural source separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Proce-ssing,201523(12):2136-2147.
[4] HERSHEY J R, CHEN Z, LE ROUX J,et al .Deep clustering:discriminative embeddings for segmentation and separation[C]∥ Proceedings of 2016 IEEE International Conference on Acoustics,Speech and Signal Processing.Shanghai:IEEE,2016:31-35.
[5] PANDEY A, WANG D .TCNN:temporal convolutional neural network for real-time speech enhancement in the time domain[C]∥ Proceedings of 2019 IEEE International Conference on Acoustics,Speech and Signal Processing.Brighton:IEEE,2019:6875-6879.
[6] LUO Y, MESGARANI N .TasNet:time-domain audio separation network for real-time,single-channel speech separation[C]∥ Proceedings of 2018 IEEE International Conference on Acoustics,Speech and Signal Processing.Calgary:IEEE,2018:696-700.
[7] LUO Y, MESGARANI N .Conv-TasNet:surpassing ideal time-frequency magnitude masking for speech separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing201927(8):1256-1266.
[8] LUO Y, CHEN Z, YOSHIOKA T .Dual-path RNN:efficient long sequence modeling for time-domain single-channel speech separation[C]∥ Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing.Barcelona:IEEE,2020:46-50.
[9] CHEN J, MAO Q, LIU D .Dual-path transformer network:direct context-aware modeling for end-to-end mo-naural speech separation[C]∥ Proceedings of the 21st Annual Conference of the International Speech Communication Association.[S.l.]:ISCA,2020:2642-2646.
[10] SUBAKAN C, RAVANELLI M, CORNELL S,et al .Attention is all you need in speech separation[C]∥ Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing.Toronto:IEEE,2021:21-25.
[11] LAM M W Y, WANG J, SU D,et al .Effective low-cost time-domain audio separation using globally attentive locally recurrent networks[C]∥ Proceedings of 2021 IEEE Spoken Language Technology Workshop.Shenzhen:IEEE,2021:801-808.
[12] LI C, YANG L, WANG W,et al .SkiM:skipping memory LSTM for low-latency real-time continuous speech separation[C]∥ Proceedings of 2022 IEEE International Conference on Acoustics,Speech and Signal Processing.Singapore:IEEE,2022:681-685.
[13] DELLA LIBERA L, SUBAKAN C, RAVANELLI M,et al .Resource-efficient separation transformer[C]∥ Proceedings of 2024 IEEE International Conference on Acoustics,Speech and Signal Processing.Seoul:IEEE,2024:761-765.
[14] ZHAO S, MA B .MossFormer:pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions[C]∥ Proceedings of 2023 IEEE International Conference on Acoustics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10096646/1-5.
[15] SHIN U, LEE S, KIM T,et al .Separate and reconstruct:asymmetric encoder-decoder for speech separation[C]∥ Advances in Neural Information Processing Systems 37:38th Conference on Neural Information Processing Systems.Red Hook:Curran Associates, Inc.,2024:52215-52240.
[16] 孙林慧,王春艳,张蒙 .基于全卷积神经网络多任务学习的时域语音分离[J].信号处理202440(12):2228-2237.
  SUN Linhui, WANG Chunyan, ZHANG Meng .Time-domain speech separation based on a fully convolutional neural network with multitask learning[J].Journal of Signal Processing202440(12):2228-2237.
[17] 余传旗,郭海燕,王婷婷,等 .基于图注意力网络和门控网络的轻量级单通道语音分离方法[J].信号处理202541(4):706-717.
  YU Chuanqi, GUO Haiyan, WANG Tingting,et al .A lightweight single channel speech separation method based on graph attention networks and gated network[J].Journal of Signal Processing202541(4):706-717.
[18] 曹毅,王彦雯,李杰,等 .基于减小高频混响和RF-DRSN-EMA的声音事件分类方法[J].华南理工大学学报(自然科学版)202553(7):70-79.
  CAO Yi, WANG Yanwen, LI Jie,et al .Acoustic scene classification method based on reducing high-frequency reverberation and RF-DRSN-EMA[J].Journal of South China University of Technology (Natural Science Edition)202553(7):70-79.
[19] GULATIA, QIN J, CHIU C,et al .Conformer:convolution-augmented transformer for speech recognition[C]∥ Proceedings of the 21st Annual Conference of the International Speech Communication Association.[S.l.]:ISCA, 2020:5036-5040.
[20] CHEN S, WU Y, CHEN Z,et al .Continuous speech separation with Conformer[C]∥ Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing.Toronto:IEEE,2021:5749-5753.
[21] LI C, WANG Y, DENG F,et al .EAD-Conformer:a Conformer-based encoder-attention-decoder-network for multi-task audio source separation[C]∥ Procee-dings of 2022 IEEE International Conference on Acou-stics,Speech and Signal Processing.Singapore:IEEE,2022:521-525.
[22] SCHEIBLER R, JI Y, CHUNG S,et al .Diffusion-based generative speech source separation[C]∥ Proceedings of 2023 IEEE International Conference on Acoustics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10095310/1-5.
[23] CHEN B, WU C, ZHAO W .SEPDIFF:speech separation based on denoising diffusion model[C]∥ Procee-dings of 2023 IEEE International Conference on Acou-stics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10095979/1-5.
[24] LEE S, JUNG C, JANG Y,et al .Seeing through the conversation:audio-visual speech separation based on diffusion model[C]∥ Proceedings of 2024 IEEE International Conference on Acoustics,Speech and Signal Processing.Seoul:IEEE,2024:12632-12636.
[25] GU A,DAO T .Mamba:linear-time sequence mode-ling with selective state spaces[EB/OL].(2023-12-18)[2025-02-16]..
[26] DAO T, GU A .Transformers are SSMs:generalized models and efficient algorithms through structured state space duality[C]∥ Proceedings of the 41st International Conference on Machine Learning.Vienna:ML Research Press,2024:1-31.
[27] LI K, CHEN G .SPMamba:state-space model is all you need in speech separation[EB/OL].(2024-04-02) [2025-02-16]..
[28] WANG Z, CORNELL S, CHOI S,et al .TF-GridNet:making time-frequency domain models great again for monaural speaker separation[C]∥ Procee-dings of 2023 IEEE International Conference on Acou-stics,Speech and Signal Processing.Rhodes Island:IEEE,2023:10094992/1-5.
[29] JIANG X, HAN C, MESGARANI N .Dual-path Mamba:short and long-term bidirectional selective structured state space models for speech separation[C]∥ Proceedings of 2025 IEEE International Confe-rence on Acoustics,Speech and Signal Processing.Hyderabad:IEEE,2025:10888514/1-5.
[30] LI K, YANG R, HU X .An efficient encoder-decoder architecture with top-down attention for speech separation[C]∥ Proceedings of the 11th International Confe-rence on Learning Representations.Kigali:OpenReview.net,2023:3495/1-16.
[31] SUBAKANC, RAVANELLI M, CORNELL S,et al .Exploring self-attention mechanisms for speech separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing202331:2169-2180.
[32] ROY A G, NAVAB N, WACHINGER C .Concurrent spatial and channel‘squeeze & excitation’ in fully convolutional networks[C]∥ Proceedings of the 21st International Conference on Medical Image Computing and Computer Assisted Intervention.Granada:Springer,2018:421-429.
[33] SHAW P, USZKOREIT J, VASWANI A .Self-attention with relative position representations[C]∥ Procee-dings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.New Orleans:ACL,2018:464-468.
[34] VASWANI A, SHAZEER N, PARMAR N, et al .Attention is all you need[C]∥ Advances in Neural Information Processing Systems 30:31st Conference on Neural Information Processing Systems.Red Hook:Curran Associates Inc.,2017:6000-6010.
[35] COSENTINO J, PARIENTE M, CORNELL S,et al .LibriMix:an open-source dataset for generalizable speech separation[EB/OL].(2020-05-22)[2025-02-16]..
[36] XU M, LI K, CHEN G,et al .TIGER:time-frequency interleaved gain extraction and reconstruction for efficient speech separation[C]∥ Proceedings of the 13th International Conference on Learning Representations.Singapore:OpenReview,2025:5539/1-18.
[37] PANAYOTOV V, CHEN G, POVEY D,et al .LibriSpeech:an ASR corpus based on public domain audio books[C]∥ Proceedings of 2015 IEEE International Conference on Acoustics,Speech and Signal Processing.South Brisbane:IEEE,2015:5206-5210.
[38] AFOURAS T, CHUNG J S, SENIOR A,et al .Deep audio-visual speech recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence202244(12):8717-8727.
[39] RAVANELLI M, PARCOLLET T, MOUMEN A,et al .Open-source conversational AI with SpeechBrain 1.0[J].Journal of Machine Learning Research202425:333/1-11.
[40] TZINIS E, WANG Z, SMARAGDIS P .Sudo RM-RF:efficient networks for universal audio source separation[C]∥ Proceedings of 2020 IEEE the 30th International Workshop on Machine Learning for Signal Processing.Espoo:IEEE,2020:9231900/1-6.
[41] HU X, LI K, ZHANG W,et al .Speech separation using an asynchronous fully recurrent convolutional neural network[C]∥ Advances in Neural Information Processing Systems 34:35th Conference on Neural Information Processing Systems.Red Hook:Curran Associates Inc.,2021:22509-22522.
Outlines

/