华南理工大学学报(自然科学版) ›› 2026, Vol. 54 ›› Issue (1): 70-82.doi: 10.12141/j.issn.1000-565X.250054

• 电子、通信与自动控制 • 上一篇    下一篇

一种基于时域全面注意力机制的单通道语音分离模型

杨俊美   张邦成   杨璐   曾徳炉   

  1. 华南理工大学 电子与信息学院,广东 广州 510640

  • 出版日期:2026-01-25 发布日期:2025-07-18

A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism

YANG Junmei  ZHANG Bangcheng  YANG Lu  ZENG Delu   

  1. School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510640, Guangdong, China

  • Online:2026-01-25 Published:2025-07-18

摘要:

基于自注意力网络的单通道语音分离技术近年来取得了显著进展。虽然自注意力网络在捕捉长序列上下文信息方面表现出色,但在实际语音场景中,其对于时间/频谱连续性、频谱结构和音色等细节的捕获仍有不足;并且,现有基于单一注意力范式的分离架构无法有效融合多尺度特征。本文提出一种端到端的时域全面注意力网络(TCANet),通过结合局部和全局注意力模块,共同解决单通道语音分离中的上述问题。局部建模采用S&C-SENet增强卷积的Conformer块,精细化捕捉语音频谱结构、音色等短时细节特征;全局建模设计了带相对位置嵌入的改进Transformer块,显式学习语音动态上下文的长时依赖;同时,通过维度变换机制衔接局部块内特征与全局块间关联,实现跨尺度特征协同优化。在基准数据集LRS2-2Mix、Libri2Mix以及EchoSet上进行的大量实验结果表明,本文提出的方法在尺度不变信噪比改善Si-SNRi、信号失真比改善SDRi上优于其他端到端语音分离方法。

关键词: 深度学习, 语音分离, Transformer, Conformer, 全面注意力

Abstract:

Recent advances in monaural speech separation leveraging self-attention mechanisms have demonstrated substantial improvements. Despite their superior capabilities in modeling long-range contextual dependencies, self-attention architectures exhibit limitations in preserving subtle acoustic characteristics, including temporal or frequency continuity, spectral structure, and timbre. Furthermore, existing single-paradigm attention frameworks lack effective mechanisms for multi-scale feature integration. To address these challenges, this paper proposes TCANet, an end-to-end time-domain comprehensive attention network incorporating local and global attention modules for monaural speech separation. Local modeling employs S&C-SENet-enhanced Conformer blocks to meticulously capture short-term spectral structures, timbral features, and other fine-grained acoustic details. Global modeling incorporates improved Transformer blocks with relative position embeddings to explicitly learn long-range dependencies within dynamic speech contexts. Furthermore, a dimension transformation mechanism bridges intra-block local features with inter-block global representations, thereby achieving cross-scale feature co-optimization. Extensive experimental results on benchmark datasets (LRS2-2Mix, Libri2Mix and EchoSet) show that the proposed method outperforms other end-to-end speech separation methods in terms of scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improvement (SDRi).

Key words: deep learning, speech separation, Transformer, Conformer, comprehensive attention