华南理工大学学报(自然科学版) ›› 2026, Vol. 54 ›› Issue (1): 70-82.doi: 10.12141/j.issn.1000-565X.250054

• 电子、通信与自动控制 • 上一篇    下一篇

一种基于时域全面注意力机制的单通道语音分离模型

杨俊美(), 张邦成, 杨璐, 曾徳炉   

  1. 华南理工大学 电子与信息学院,广东 广州 510640
  • 收稿日期:2025-03-03 出版日期:2026-01-10 发布日期:2025-07-18
  • 作者简介:杨俊美(1979—),女,博士,副教授,主要从事智能信号处理、自适应滤波、图像超分辨率重建、语音去混响等研究。E-mail: yjunmei@scut.edu.cn
  • 基金资助:
    广东省自然科学基金项目(2023A1515011281)

A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism

YANG Junmei(), ZHANG Bangcheng, YANG Lu, ZENG Delu   

  1. School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China
  • Received:2025-03-03 Online:2026-01-10 Published:2025-07-18
  • Supported by:
    the Natural Science Foundation of Guangdong Province(2023A1515011281)

摘要:

单通道语音分离旨在从单一麦克风采集的混合语音中分离出目标说话人的纯净语音,在智能家居、会议系统和助听设备等场景具有重要应用价值。随着深度学习技术的快速发展,基于自注意力网络的单通道语音分离技术取得显著进展。尽管自注意力网络在捕捉长序列上下文信息方面表现优异,但其对实际语音场景中时间/频谱连续性、频谱结构和音色等细节特征的捕捉仍存在局限。此外,现有基于单一注意力范式的分离架构难以实现多尺度特征的有效融合。该文提出了一种时域全面注意力网络(TCANet)模型,通过局部与全局注意力模块的协同设计,针对性地解决单通道语音分离中的上述问题。局部建模采用S&C-SENet增强的Conformer结构,以精细提取语音频谱结构、音色等短时特征;全局建模则构建含相对位置嵌入的改进型Transformer模块,显式学习语音长时依赖关系;同时,TCANet通过维度变换机制实现局部块内特征与全局块间关联的跨尺度融合。在基准数据集LRS2-2Mix、Libri2Mix和EchoSet上的实验结果表明,该方法在尺度不变信噪比改善(SI-SNRi)和信号失真比改善(SDRi)指标上均优于现有端到端语音分离方法。

关键词: 深度学习, 语音分离, Transformer模块, Conformer结构, 全面注意力

Abstract:

Single-channel speech separation aims to extract clean target speaker speech from a mixed audio signal recorded by a single microphone, with significant application value in scenarios such as smart homes, conference systems, and hearing aids. With the rapid development of deep learning technology, self-attention network-based approaches to single-channel speech separation have achieved remarkable progress. While self-attention networks excel at capturing contextual information in long sequence, they still exhibit limitations in capturing detailed features such as temporal/spectral continuity, spectral structure, and timbre in real-world speech scenarios. Moreover, existing separation architectures based on a single attention paradigm struggle to achieve effective multi-scale feature fusion. To address these challenges, this paper proposed a Temporal Comprehensive Attention Network (TCANet), which addresses the aforementioned issues through a synergistic design of local and global attention modules. Local modeling employs an S&C-SENet-enhanced Conformer structure to capture short-term features such as spectral structure and timbre in detail, while global modeling incorporates a modified Transformer module with relative position embedding to explicitly learn long-term speech dependencies in speech. Furthermore, TCANet achieves cross-scale fusion of intra-block local features and inter-block global correlations through a dimension transformation mechanism. Experimental results on three benchmark datasets—LRS2-2Mix, Libri2Mix, and EchoSet—demonstrate that the proposed method outperforms existing end-to-end speech separation approaches in terms of scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improvement (SDRi).

Key words: deep learning, speech separation, Transformer module, Conformer structure, comprehensive attention

中图分类号: