A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism

doi:10.12141/j.issn.1000-565X.250054

Abstract

Abstract:

Single-channel speech separation aims to extract clean target speaker speech from a mixed audio signal recorded by a single microphone, with significant application value in scenarios such as smart homes, conference systems, and hearing aids. With the rapid development of deep learning technology, self-attention network-based approaches to single-channel speech separation have achieved remarkable progress. While self-attention networks excel at capturing contextual information in long sequence, they still exhibit limitations in capturing detailed features such as temporal/spectral continuity, spectral structure, and timbre in real-world speech scenarios. Moreover, existing separation architectures based on a single attention paradigm struggle to achieve effective multi-scale feature fusion. To address these challenges, this paper proposed a Temporal Comprehensive Attention Network (TCANet), which addresses the aforementioned issues through a synergistic design of local and global attention modules. Local modeling employs an S&C-SENet-enhanced Conformer structure to capture short-term features such as spectral structure and timbre in detail, while global modeling incorporates a modified Transformer module with relative position embedding to explicitly learn long-term speech dependencies in speech. Furthermore, TCANet achieves cross-scale fusion of intra-block local features and inter-block global correlations through a dimension transformation mechanism. Experimental results on three benchmark datasets—LRS2-2Mix, Libri2Mix, and EchoSet—demonstrate that the proposed method outperforms existing end-to-end speech separation approaches in terms of scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improvement (SDRi).

Key words: deep learning, speech separation, Transformer module, Conformer structure, comprehensive attention

CLC Number:

TP391

YANG Junmei, ZHANG Bangcheng, YANG Lu, ZENG Delu. A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism[J]. Journal of South China University of Technology(Natural Science Edition), 2026, 54(1): 70-82.

Figures/Tables 9

Fig.1

Fig.2

Fig.3

Fig.4

Fig.5

Fig.6

Fig.7

Table 1

Table 2

References 41

[1]	AGRAWAL J， GUPTA M， GARG H ．A review on speech separation in cocktail party environment：cha-llenges and approaches［J］．Multimedia Tools and Applications，2023，82（20）：31035-31067.
[2]	XUY， DU J， DAI L，et al ．A regression approach to speech enhancement based on deep neural networks［J］．IEEE/ACM Transactions on Audio，Speech，and Language Processing，2015，23（1）：7-19.
[3]	HUANG P， KIM M， HASEGAWA-JOHNSON M，et al ．Joint optimization of masks and deep recurrent neural networks for monaural source separation［J］．IEEE/ACM Transactions on Audio，Speech，and Language Proce-ssing，2015，23（12）：2136-2147.
[4]	HERSHEY J R， CHEN Z， LE ROUX J，et al ．Deep clustering：discriminative embeddings for segmentation and separation［C］∥ Proceedings of 2016 IEEE International Conference on Acoustics，Speech and Signal Processing．Shanghai：IEEE，2016：31-35．
[5]	PANDEY A， WANG D ．TCNN：temporal convolutional neural network for real-time speech enhancement in the time domain［C］∥ Proceedings of 2019 IEEE International Conference on Acoustics，Speech and Signal Processing．Brighton：IEEE，2019：6875-6879.
[6]	LUO Y， MESGARANI N ．TasNet：time-domain audio separation network for real-time，single-channel speech separation［C］∥ Proceedings of 2018 IEEE International Conference on Acoustics，Speech and Signal Processing．Calgary：IEEE，2018：696-700.
[7]	LUO Y， MESGARANI N ．Conv-TasNet：surpassing ideal time-frequency magnitude masking for speech separation［J］．IEEE/ACM Transactions on Audio，Speech，and Language Processing，2019，27（8）：1256-1266.
[8]	LUO Y， CHEN Z， YOSHIOKA T ．Dual-path RNN：efficient long sequence modeling for time-domain single-channel speech separation［C］∥ Proceedings of 2020 IEEE International Conference on Acoustics，Speech and Signal Processing．Barcelona：IEEE，2020：46-50.
[9]	CHEN J， MAO Q， LIU D ．Dual-path transformer network：direct context-aware modeling for end-to-end mo-naural speech separation［C］∥ Proceedings of the 21st Annual Conference of the International Speech Communication Association．［S.l.］：ISCA，2020：2642-2646.
[10]	SUBAKAN C， RAVANELLI M， CORNELL S，et al ．Attention is all you need in speech separation［C］∥ Proceedings of 2021 IEEE International Conference on Acoustics，Speech and Signal Processing．Toronto：IEEE，2021：21-25.
[11]	LAM M W Y， WANG J， SU D，et al ．Effective low-cost time-domain audio separation using globally attentive locally recurrent networks［C］∥ Proceedings of 2021 IEEE Spoken Language Technology Workshop．Shenzhen：IEEE，2021：801-808.
[12]	LI C， YANG L， WANG W，et al ．SkiM：skipping memory LSTM for low-latency real-time continuous speech separation［C］∥ Proceedings of 2022 IEEE International Conference on Acoustics，Speech and Signal Processing．Singapore：IEEE，2022：681-685.
[13]	DELLA LIBERA L， SUBAKAN C， RAVANELLI M，et al ．Resource-efficient separation transformer［C］∥ Proceedings of 2024 IEEE International Conference on Acoustics，Speech and Signal Processing．Seoul：IEEE，2024：761-765.
[14]	ZHAO S， MA B ．MossFormer：pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions［C］∥ Proceedings of 2023 IEEE International Conference on Acoustics，Speech and Signal Processing．Rhodes Island：IEEE，2023：10096646/1-5.
[15]	SHIN U， LEE S， KIM T，et al ．Separate and reconstruct：asymmetric encoder-decoder for speech separation［C］∥ Advances in Neural Information Processing Systems 37：38th Conference on Neural Information Processing Systems．Red Hook：Curran Associates， Inc.，2024：52215-52240.
[16]	孙林慧，王春艳，张蒙．基于全卷积神经网络多任务学习的时域语音分离［J］．信号处理，2024，40（12）：2228-2237.
	SUN Linhui， WANG Chunyan， ZHANG Meng ．Time-domain speech separation based on a fully convolutional neural network with multitask learning［J］．Journal of Signal Processing，2024，40（12）：2228-2237.
[17]	余传旗，郭海燕，王婷婷，等．基于图注意力网络和门控网络的轻量级单通道语音分离方法［J］．信号处理，2025，41（4）：706-717.
	YU Chuanqi， GUO Haiyan， WANG Tingting，et al ．A lightweight single channel speech separation method based on graph attention networks and gated network［J］．Journal of Signal Processing，2025，41（4）：706-717.
[18]	曹毅，王彦雯，李杰，等．基于减小高频混响和RF-DRSN-EMA的声音事件分类方法［J］．华南理工大学学报（自然科学版），2025，53（7）：70-79.
	CAO Yi， WANG Yanwen， LI Jie，et al ．Acoustic scene classification method based on reducing high-frequency reverberation and RF-DRSN-EMA［J］．Journal of South China University of Technology （Natural Science Edition），2025，53（7）：70-79.
[19]	GULATIA， QIN J， CHIU C，et al ．Conformer：convolution-augmented transformer for speech recognition［C］∥ Proceedings of the 21st Annual Conference of the International Speech Communication Association.［S.l.］：ISCA， 2020：5036-5040.
[20]	CHEN S， WU Y， CHEN Z，et al ．Continuous speech separation with Conformer［C］∥ Proceedings of 2021 IEEE International Conference on Acoustics，Speech and Signal Processing．Toronto：IEEE，2021：5749-5753.
[21]	LI C， WANG Y， DENG F，et al ．EAD-Conformer：a Conformer-based encoder-attention-decoder-network for multi-task audio source separation［C］∥ Procee-dings of 2022 IEEE International Conference on Acou-stics，Speech and Signal Processing．Singapore：IEEE，2022：521-525.
[22]	SCHEIBLER R， JI Y， CHUNG S，et al ．Diffusion-based generative speech source separation［C］∥ Proceedings of 2023 IEEE International Conference on Acoustics，Speech and Signal Processing．Rhodes Island：IEEE，2023：10095310/1-5.
[23]	CHEN B， WU C， ZHAO W ．SEPDIFF：speech separation based on denoising diffusion model［C］∥ Procee-dings of 2023 IEEE International Conference on Acou-stics，Speech and Signal Processing．Rhodes Island：IEEE，2023：10095979/1-5.
[24]	LEE S， JUNG C， JANG Y，et al ．Seeing through the conversation：audio-visual speech separation based on diffusion model［C］∥ Proceedings of 2024 IEEE International Conference on Acoustics，Speech and Signal Processing．Seoul：IEEE，2024：12632-12636.
[25]	GU A，DAO T ．Mamba：linear-time sequence mode-ling with selective state spaces［EB/OL］．（2023-12-18）［2025-02-16］．.
[26]	DAO T， GU A ．Transformers are SSMs：generalized models and efficient algorithms through structured state space duality［C］∥ Proceedings of the 41st International Conference on Machine Learning．Vienna：ML Research Press，2024：1-31.
[27]	LI K， CHEN G ．SPMamba：state-space model is all you need in speech separation［EB/OL］．（2024-04-02）［2025-02-16］．.
[28]	WANG Z， CORNELL S， CHOI S，et al ．TF-GridNet：making time-frequency domain models great again for monaural speaker separation［C］∥ Procee-dings of 2023 IEEE International Conference on Acou-stics，Speech and Signal Processing．Rhodes Island：IEEE，2023：10094992/1-5.
[29]	JIANG X， HAN C， MESGARANI N ．Dual-path Mamba：short and long-term bidirectional selective structured state space models for speech separation［C］∥ Proceedings of 2025 IEEE International Confe-rence on Acoustics，Speech and Signal Processing．Hyderabad：IEEE，2025：10888514/1-5.
[30]	LI K， YANG R， HU X ．An efficient encoder-decoder architecture with top-down attention for speech separation［C］∥ Proceedings of the 11th International Confe-rence on Learning Representations．Kigali：OpenReview.net，2023：3495/1-16.
[31]	SUBAKANC， RAVANELLI M， CORNELL S，et al ．Exploring self-attention mechanisms for speech separation［J］．IEEE/ACM Transactions on Audio，Speech，and Language Processing，2023，31：2169-2180.
[32]	ROY A G， NAVAB N， WACHINGER C ．Concurrent spatial and channel‘squeeze & excitation’ in fully convolutional networks［C］∥ Proceedings of the 21st International Conference on Medical Image Computing and Computer Assisted Intervention．Granada：Springer，2018：421-429.
[33]	SHAW P， USZKOREIT J， VASWANI A ．Self-attention with relative position representations［C］∥ Procee-dings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies．New Orleans：ACL，2018：464-468.
[34]	VASWANI A， SHAZEER N， PARMAR N， et al ．Attention is all you need［C］∥ Advances in Neural Information Processing Systems 30：31st Conference on Neural Information Processing Systems．Red Hook：Curran Associates Inc.，2017：6000-6010.
[35]	COSENTINO J， PARIENTE M， CORNELL S，et al ．LibriMix：an open-source dataset for generalizable speech separation［EB/OL］．（2020-05-22）［2025-02-16］．.
[36]	XU M， LI K， CHEN G，et al ．TIGER：time-frequency interleaved gain extraction and reconstruction for efficient speech separation［C］∥ Proceedings of the 13th International Conference on Learning Representations．Singapore：OpenReview，2025：5539/1-18.
[37]	PANAYOTOV V， CHEN G， POVEY D，et al ．LibriSpeech：an ASR corpus based on public domain audio books［C］∥ Proceedings of 2015 IEEE International Conference on Acoustics，Speech and Signal Processing．South Brisbane：IEEE，2015：5206-5210.
[38]	AFOURAS T， CHUNG J S， SENIOR A，et al ．Deep audio-visual speech recognition［J］．IEEE Transactions on Pattern Analysis and Machine Intelligence，2022，44（12）：8717-8727.
[39]	RAVANELLI M， PARCOLLET T， MOUMEN A，et al ．Open-source conversational AI with SpeechBrain 1.0［J］．Journal of Machine Learning Research，2024，25：333/1-11.
[40]	TZINIS E， WANG Z， SMARAGDIS P ．Sudo RM-RF：efficient networks for universal audio source separation［C］∥ Proceedings of 2020 IEEE the 30th International Workshop on Machine Learning for Signal Processing．Espoo：IEEE，2020：9231900/1-6.
[41]	HU X， LI K， ZHANG W，et al ．Speech separation using an asynchronous fully recurrent convolutional neural network［C］∥ Advances in Neural Information Processing Systems 34：35th Conference on Neural Information Processing Systems．Red Hook：Curran Associates Inc.，2021：22509-22522.

模型	I_SI-SNRi/dB			I_SDRi/dB			计算速度/GFLOPS	参数量/10⁶
模型	Libri2Mix	LRS2-2Mix	EchoSet	Libri2Mix	LRS2-2Mix	EchoSet	计算速度/GFLOPS	参数量/10⁶
BLSTM-TasNet	7.9	6.1	5.2	8.7	6.8	4.3	23.1	23.6
Conv-TasNet	12.2	10.6	7.7	12.7	11.0	6.9	6.3	5.6
SuDoRM-RF1.0x	13.5	11.0	7.7	14.0	11.4	6.8	2.5	2.7
SuDoRM-RF2.5x	14.0	11.3	8.1	14.4	11.7	7.0	19.8	6.4
DPRNN	16.1	12.7	5.9	16.6	13.0	5.1	125.3	2.7
DPTNet	16.7	13.3	8.9	17.1	13.6	8.1	171.1	2.7
A-FRCNN-16	16.7	13.0	9.6	17.2	13.3	8.8	22.8	6.1
TDANet	16.9	13.2	10.1	17.4	13.5	9.2	9.3	2.3
SepFormer	16.5	13.5	9.7	17.0	13.8	8.7	145.6	26.0
TCANet	17.0	14.1	9.9	17.3	14.5	9.1	108.3	20.0

局部注意力网络		全局注意力网络	维度变换特征融合	I_SI-SNRi/dB		I_SDRi/dB
局部注意力网络				Libri2Mix	LRS2-2Mix	Libri2Mix	LRS2-2Mix
完整	无S&C-SENet			Libri2Mix	LRS2-2Mix	Libri2Mix	LRS2-2Mix
√		√		16.3	13.4	16.6	13.6
	√	√	√	16.6	13.7	17.0	14.1
		√		15.9	12.5	16.3	12.9
√				16.0	12.8	16.3	13.1
√		√	√	17.0	14.1	17.3	14.5