A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism

YANG Junmei; ZHANG Bangcheng; YANG Lu; ZENG Delu

doi:10.12141/j.issn.1000-565X.250054

Journal of South China University of Technology(Natural Science) >

2026 , Vol. 54 >Issue 1: 70 - 82

DOI: https://doi.org/10.12141/j.issn.1000-565X.250054

Electronics, Communication & Automation Technology

A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism

YANG Junmei ,
ZHANG Bangcheng ,
YANG Lu ,
ZENG Delu

Expand

School of Electronic and Information Engineering，South China University of Technology，Guangzhou 510640，Guangdong，China

Received date: 2025-03-03

Online published: 2025-07-18

Supported by

the Natural Science Foundation of Guangdong Province(2023A1515011281)

Fold

Abstract

Single-channel speech separation aims to extract clean target speaker speech from a mixed audio signal recorded by a single microphone, with significant application value in scenarios such as smart homes, conference systems, and hearing aids. With the rapid development of deep learning technology, self-attention network-based approaches to single-channel speech separation have achieved remarkable progress. While self-attention networks excel at capturing contextual information in long sequence, they still exhibit limitations in capturing detailed features such as temporal/spectral continuity, spectral structure, and timbre in real-world speech scenarios. Moreover, existing separation architectures based on a single attention paradigm struggle to achieve effective multi-scale feature fusion. To address these challenges, this paper proposed a Temporal Comprehensive Attention Network (TCANet), which addresses the aforementioned issues through a synergistic design of local and global attention modules. Local modeling employs an S&C-SENet-enhanced Conformer structure to capture short-term features such as spectral structure and timbre in detail, while global modeling incorporates a modified Transformer module with relative position embedding to explicitly learn long-term speech dependencies in speech. Furthermore, TCANet achieves cross-scale fusion of intra-block local features and inter-block global correlations through a dimension transformation mechanism. Experimental results on three benchmark datasets—LRS2-2Mix, Libri2Mix, and EchoSet—demonstrate that the proposed method outperforms existing end-to-end speech separation approaches in terms of scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improvement (SDRi).

Key words： deep learning; speech separation; Transformer module; Conformer structure; comprehensive attention

Cite this article

YANG Junmei , ZHANG Bangcheng , YANG Lu , ZENG Delu . A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism[J]. Journal of South China University of Technology(Natural Science), 2026 , 54(1) : 70 -82 . DOI: 10.12141/j.issn.1000-565X.250054

References

[1]	AGRAWAL J， GUPTA M， GARG H ．A review on speech separation in cocktail party environment：cha-llenges and approaches［J］．Multimedia Tools and Applications，2023，82（20）：31035-31067.
[2]	XUY， DU J， DAI L，et al ．A regression approach to speech enhancement based on deep neural networks［J］．IEEE/ACM Transactions on Audio，Speech，and Language Processing，2015，23（1）：7-19.
[3]	HUANG P， KIM M， HASEGAWA-JOHNSON M，et al ．Joint optimization of masks and deep recurrent neural networks for monaural source separation［J］．IEEE/ACM Transactions on Audio，Speech，and Language Proce-ssing，2015，23（12）：2136-2147.
[4]	HERSHEY J R， CHEN Z， LE ROUX J，et al ．Deep clustering：discriminative embeddings for segmentation and separation［C］∥ Proceedings of 2016 IEEE International Conference on Acoustics，Speech and Signal Processing．Shanghai：IEEE，2016：31-35．
[5]	PANDEY A， WANG D ．TCNN：temporal convolutional neural network for real-time speech enhancement in the time domain［C］∥ Proceedings of 2019 IEEE International Conference on Acoustics，Speech and Signal Processing．Brighton：IEEE，2019：6875-6879.
[6]	LUO Y， MESGARANI N ．TasNet：time-domain audio separation network for real-time，single-channel speech separation［C］∥ Proceedings of 2018 IEEE International Conference on Acoustics，Speech and Signal Processing．Calgary：IEEE，2018：696-700.
[7]	LUO Y， MESGARANI N ．Conv-TasNet：surpassing ideal time-frequency magnitude masking for speech separation［J］．IEEE/ACM Transactions on Audio，Speech，and Language Processing，2019，27（8）：1256-1266.
[8]	LUO Y， CHEN Z， YOSHIOKA T ．Dual-path RNN：efficient long sequence modeling for time-domain single-channel speech separation［C］∥ Proceedings of 2020 IEEE International Conference on Acoustics，Speech and Signal Processing．Barcelona：IEEE，2020：46-50.
[9]	CHEN J， MAO Q， LIU D ．Dual-path transformer network：direct context-aware modeling for end-to-end mo-naural speech separation［C］∥ Proceedings of the 21st Annual Conference of the International Speech Communication Association．［S.l.］：ISCA，2020：2642-2646.
[10]	SUBAKAN C， RAVANELLI M， CORNELL S，et al ．Attention is all you need in speech separation［C］∥ Proceedings of 2021 IEEE International Conference on Acoustics，Speech and Signal Processing．Toronto：IEEE，2021：21-25.
[11]	LAM M W Y， WANG J， SU D，et al ．Effective low-cost time-domain audio separation using globally attentive locally recurrent networks［C］∥ Proceedings of 2021 IEEE Spoken Language Technology Workshop．Shenzhen：IEEE，2021：801-808.
[12]	LI C， YANG L， WANG W，et al ．SkiM：skipping memory LSTM for low-latency real-time continuous speech separation［C］∥ Proceedings of 2022 IEEE International Conference on Acoustics，Speech and Signal Processing．Singapore：IEEE，2022：681-685.
[13]	DELLA LIBERA L， SUBAKAN C， RAVANELLI M，et al ．Resource-efficient separation transformer［C］∥ Proceedings of 2024 IEEE International Conference on Acoustics，Speech and Signal Processing．Seoul：IEEE，2024：761-765.
[14]	ZHAO S， MA B ．MossFormer：pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions［C］∥ Proceedings of 2023 IEEE International Conference on Acoustics，Speech and Signal Processing．Rhodes Island：IEEE，2023：10096646/1-5.
[15]	SHIN U， LEE S， KIM T，et al ．Separate and reconstruct：asymmetric encoder-decoder for speech separation［C］∥ Advances in Neural Information Processing Systems 37：38th Conference on Neural Information Processing Systems．Red Hook：Curran Associates， Inc.，2024：52215-52240.
[16]	孙林慧，王春艳，张蒙．基于全卷积神经网络多任务学习的时域语音分离［J］．信号处理，2024，40（12）：2228-2237.
	SUN Linhui， WANG Chunyan， ZHANG Meng ．Time-domain speech separation based on a fully convolutional neural network with multitask learning［J］．Journal of Signal Processing，2024，40（12）：2228-2237.
[17]	余传旗，郭海燕，王婷婷，等．基于图注意力网络和门控网络的轻量级单通道语音分离方法［J］．信号处理，2025，41（4）：706-717.
	YU Chuanqi， GUO Haiyan， WANG Tingting，et al ．A lightweight single channel speech separation method based on graph attention networks and gated network［J］．Journal of Signal Processing，2025，41（4）：706-717.
[18]	曹毅，王彦雯，李杰，等．基于减小高频混响和RF-DRSN-EMA的声音事件分类方法［J］．华南理工大学学报（自然科学版），2025，53（7）：70-79.
	CAO Yi， WANG Yanwen， LI Jie，et al ．Acoustic scene classification method based on reducing high-frequency reverberation and RF-DRSN-EMA［J］．Journal of South China University of Technology （Natural Science Edition），2025，53（7）：70-79.
[19]	GULATIA， QIN J， CHIU C，et al ．Conformer：convolution-augmented transformer for speech recognition［C］∥ Proceedings of the 21st Annual Conference of the International Speech Communication Association.［S.l.］：ISCA， 2020：5036-5040.
[20]	CHEN S， WU Y， CHEN Z，et al ．Continuous speech separation with Conformer［C］∥ Proceedings of 2021 IEEE International Conference on Acoustics，Speech and Signal Processing．Toronto：IEEE，2021：5749-5753.
[21]	LI C， WANG Y， DENG F，et al ．EAD-Conformer：a Conformer-based encoder-attention-decoder-network for multi-task audio source separation［C］∥ Procee-dings of 2022 IEEE International Conference on Acou-stics，Speech and Signal Processing．Singapore：IEEE，2022：521-525.
[22]	SCHEIBLER R， JI Y， CHUNG S，et al ．Diffusion-based generative speech source separation［C］∥ Proceedings of 2023 IEEE International Conference on Acoustics，Speech and Signal Processing．Rhodes Island：IEEE，2023：10095310/1-5.
[23]	CHEN B， WU C， ZHAO W ．SEPDIFF：speech separation based on denoising diffusion model［C］∥ Procee-dings of 2023 IEEE International Conference on Acou-stics，Speech and Signal Processing．Rhodes Island：IEEE，2023：10095979/1-5.
[24]	LEE S， JUNG C， JANG Y，et al ．Seeing through the conversation：audio-visual speech separation based on diffusion model［C］∥ Proceedings of 2024 IEEE International Conference on Acoustics，Speech and Signal Processing．Seoul：IEEE，2024：12632-12636.
[25]	GU A，DAO T ．Mamba：linear-time sequence mode-ling with selective state spaces［EB/OL］．（2023-12-18）［2025-02-16］．.
[26]	DAO T， GU A ．Transformers are SSMs：generalized models and efficient algorithms through structured state space duality［C］∥ Proceedings of the 41st International Conference on Machine Learning．Vienna：ML Research Press，2024：1-31.
[27]	LI K， CHEN G ．SPMamba：state-space model is all you need in speech separation［EB/OL］．（2024-04-02）［2025-02-16］．.
[28]	WANG Z， CORNELL S， CHOI S，et al ．TF-GridNet：making time-frequency domain models great again for monaural speaker separation［C］∥ Procee-dings of 2023 IEEE International Conference on Acou-stics，Speech and Signal Processing．Rhodes Island：IEEE，2023：10094992/1-5.
[29]	JIANG X， HAN C， MESGARANI N ．Dual-path Mamba：short and long-term bidirectional selective structured state space models for speech separation［C］∥ Proceedings of 2025 IEEE International Confe-rence on Acoustics，Speech and Signal Processing．Hyderabad：IEEE，2025：10888514/1-5.
[30]	LI K， YANG R， HU X ．An efficient encoder-decoder architecture with top-down attention for speech separation［C］∥ Proceedings of the 11th International Confe-rence on Learning Representations．Kigali：OpenReview.net，2023：3495/1-16.
[31]	SUBAKANC， RAVANELLI M， CORNELL S，et al ．Exploring self-attention mechanisms for speech separation［J］．IEEE/ACM Transactions on Audio，Speech，and Language Processing，2023，31：2169-2180.
[32]	ROY A G， NAVAB N， WACHINGER C ．Concurrent spatial and channel‘squeeze & excitation’ in fully convolutional networks［C］∥ Proceedings of the 21st International Conference on Medical Image Computing and Computer Assisted Intervention．Granada：Springer，2018：421-429.
[33]	SHAW P， USZKOREIT J， VASWANI A ．Self-attention with relative position representations［C］∥ Procee-dings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies．New Orleans：ACL，2018：464-468.
[34]	VASWANI A， SHAZEER N， PARMAR N， et al ．Attention is all you need［C］∥ Advances in Neural Information Processing Systems 30：31st Conference on Neural Information Processing Systems．Red Hook：Curran Associates Inc.，2017：6000-6010.
[35]	COSENTINO J， PARIENTE M， CORNELL S，et al ．LibriMix：an open-source dataset for generalizable speech separation［EB/OL］．（2020-05-22）［2025-02-16］．.
[36]	XU M， LI K， CHEN G，et al ．TIGER：time-frequency interleaved gain extraction and reconstruction for efficient speech separation［C］∥ Proceedings of the 13th International Conference on Learning Representations．Singapore：OpenReview，2025：5539/1-18.
[37]	PANAYOTOV V， CHEN G， POVEY D，et al ．LibriSpeech：an ASR corpus based on public domain audio books［C］∥ Proceedings of 2015 IEEE International Conference on Acoustics，Speech and Signal Processing．South Brisbane：IEEE，2015：5206-5210.
[38]	AFOURAS T， CHUNG J S， SENIOR A，et al ．Deep audio-visual speech recognition［J］．IEEE Transactions on Pattern Analysis and Machine Intelligence，2022，44（12）：8717-8727.
[39]	RAVANELLI M， PARCOLLET T， MOUMEN A，et al ．Open-source conversational AI with SpeechBrain 1.0［J］．Journal of Machine Learning Research，2024，25：333/1-11.
[40]	TZINIS E， WANG Z， SMARAGDIS P ．Sudo RM-RF：efficient networks for universal audio source separation［C］∥ Proceedings of 2020 IEEE the 30th International Workshop on Machine Learning for Signal Processing．Espoo：IEEE，2020：9231900/1-6.
[41]	HU X， LI K， ZHANG W，et al ．Speech separation using an asynchronous fully recurrent convolutional neural network［C］∥ Advances in Neural Information Processing Systems 34：35th Conference on Neural Information Processing Systems．Red Hook：Curran Associates Inc.，2021：22509-22522.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References