Journal of South China University of Technology(Natural Science Edition) ›› 2026, Vol. 54 ›› Issue (1): 70-82.doi: 10.12141/j.issn.1000-565X.250054

• Electronics, Communication & Automation Technology • Previous Articles     Next Articles

A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism

YANG Junmei  ZHANG Bangcheng  YANG Lu  ZENG Delu   

  1. School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510640, Guangdong, China

  • Online:2026-01-25 Published:2025-07-18

Abstract:

Recent advances in monaural speech separation leveraging self-attention mechanisms have demonstrated substantial improvements. Despite their superior capabilities in modeling long-range contextual dependencies, self-attention architectures exhibit limitations in preserving subtle acoustic characteristics, including temporal or frequency continuity, spectral structure, and timbre. Furthermore, existing single-paradigm attention frameworks lack effective mechanisms for multi-scale feature integration. To address these challenges, this paper proposes TCANet, an end-to-end time-domain comprehensive attention network incorporating local and global attention modules for monaural speech separation. Local modeling employs S&C-SENet-enhanced Conformer blocks to meticulously capture short-term spectral structures, timbral features, and other fine-grained acoustic details. Global modeling incorporates improved Transformer blocks with relative position embeddings to explicitly learn long-range dependencies within dynamic speech contexts. Furthermore, a dimension transformation mechanism bridges intra-block local features with inter-block global representations, thereby achieving cross-scale feature co-optimization. Extensive experimental results on benchmark datasets (LRS2-2Mix, Libri2Mix and EchoSet) show that the proposed method outperforms other end-to-end speech separation methods in terms of scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improvement (SDRi).

Key words: deep learning, speech separation, Transformer, Conformer, comprehensive attention