Journal of South China University of Technology(Natural Science Edition) ›› 2026, Vol. 54 ›› Issue (1): 70-82.doi: 10.12141/j.issn.1000-565X.250054

• Electronics, Communication & Automation Technology • Previous Articles     Next Articles

A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism

YANG Junmei(), ZHANG Bangcheng, YANG Lu, ZENG Delu   

  1. School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China
  • Received:2025-03-03 Online:2026-01-10 Published:2025-07-18
  • Supported by:
    the Natural Science Foundation of Guangdong Province(2023A1515011281)

Abstract:

Single-channel speech separation aims to extract clean target speaker speech from a mixed audio signal recorded by a single microphone, with significant application value in scenarios such as smart homes, conference systems, and hearing aids. With the rapid development of deep learning technology, self-attention network-based approaches to single-channel speech separation have achieved remarkable progress. While self-attention networks excel at capturing contextual information in long sequence, they still exhibit limitations in capturing detailed features such as temporal/spectral continuity, spectral structure, and timbre in real-world speech scenarios. Moreover, existing separation architectures based on a single attention paradigm struggle to achieve effective multi-scale feature fusion. To address these challenges, this paper proposed a Temporal Comprehensive Attention Network (TCANet), which addresses the aforementioned issues through a synergistic design of local and global attention modules. Local modeling employs an S&C-SENet-enhanced Conformer structure to capture short-term features such as spectral structure and timbre in detail, while global modeling incorporates a modified Transformer module with relative position embedding to explicitly learn long-term speech dependencies in speech. Furthermore, TCANet achieves cross-scale fusion of intra-block local features and inter-block global correlations through a dimension transformation mechanism. Experimental results on three benchmark datasets—LRS2-2Mix, Libri2Mix, and EchoSet—demonstrate that the proposed method outperforms existing end-to-end speech separation approaches in terms of scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improvement (SDRi).

Key words: deep learning, speech separation, Transformer module, Conformer structure, comprehensive attention

CLC Number: