华南理工大学学报(自然科学版) ›› 2021, Vol. 49 ›› Issue (11): 106-115,134.doi: 10.12141/j.issn.1000-565X.200593

所属专题: 2021年电子、通信与自动控制

• 电子、通信与自动控制 • 上一篇    下一篇

结合SE与BiSRU的Unet的音乐源分离方法

张瑞峰 白金桐 关欣 李锵   

  1. 天津大学 微电子学院,天津 300072
  • 收稿日期:2020-09-30 修回日期:2020-12-29 出版日期:2021-11-25 发布日期:2021-11-01
  • 通信作者: 关欣(1977-),女,博士,副教授,主要从事声音与音乐计算研究。 E-mail:guanxin@tju.edu.cn
  • 作者简介:张瑞峰(1974-),男,博士,副教授,主要从事机器视觉与音频处理研究。E-mail:zhangruifeng@tju.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(61471263);天津市自然科学基金资助项目(16JCZDJC31100)

Music Source Separation Method Based on Unet Combining SE and BiSRU

ZHANG Ruifeng BAI Jintong GUAN Xin LI Qiang   

  1. School of Microelectronics,Tianjin University,Tianjin 300072,China
  • Received:2020-09-30 Revised:2020-12-29 Online:2021-11-25 Published:2021-11-01
  • Contact: 关欣(1977-),女,博士,副教授,主要从事声音与音乐计算研究。 E-mail:guanxin@tju.edu.cn
  • About author:张瑞峰(1974-),男,博士,副教授,主要从事机器视觉与音频处理研究。E-mail:zhangruifeng@tju.edu.cn
  • Supported by:
    Supported by the National Natural Science Foundation of China (61471263) and the Natural Science Foundation of Tianjin (16JCZDJC31100)

摘要: 音乐源分离在音乐信息检索领域有着重要的研究价值。传统音乐源分离方法存在依赖假设、模型复杂度有限、表示能力不足等问题。能应对这些问题的时域深度学习端到端网络模型训练耗时长,且分离性能有待提升。为进一步改善时域端到端分离模型的表示能力和计算效率,在目前时域分离性能最优的Demucs模型基础上进行改进,提出了一种端对端网络Unet-SE-BiSRU。该模型在广义编码层和解码层中引入了注意力机制,采用挤压-激励块(SE)根据待分离音频的种类有选择地提取特征;在一维卷积后增加组归一化,以应对在学习过程中可能出现的梯度爆炸或梯度消失问题;将双向长短期记忆网络改进为双向简单循环单元(BiSRU),进一步提高了学习的并行性,且降低了模型参数量。实验结果表明,改进后的网络模型的信噪比指标提升了0.34dB,在目前检索到的文献的时域端对端方法中取得了最好的分离性能,并且训练时间缩短为源模型的2/5。

关键词: 音乐源分离, U型网络, 时域端到端分离模型, 简单循环单元, 挤压-激励块, 组归一化

Abstract: Music source separation is one of the most important research topics in the field of music information retrieval.Traditional music source separation methods have shortcomings,such as hypothesis dependence,limited model complexity,and poor representation ability.To resolve these problems,it takes a long time to train the time-domain end-to-end deep learning network model,and the separation performance still needs to be improved.Therefore,in order to further optimize the representation ability and computational efficiency of the time domain end-to-end separation model,the study proposed an end-to-end network Unet-SE-BiSRU based on the Demucs model which has the best performance in time domain separation at present.Attention mechanism was introduced into the generalized coding layer and decoding layer,and the squeezing-excitation block(SE) was used to extract features selectively according to the type of audio to be separated.To deal with gradient explosion or disappearance that may occur in the learning process,a group normalization was added after one-dimensional con-volution.The bidirectional long short-term memory network was refined to a bidirectional simple recurrent unit(BiSRU),which improves the parallelism of learning and reduces the amount of model parameters.The experimental results show that the signal-noise ratio of the improved network model is improved by 0.34dB,which is the best one among the time-domain end-to end methods to the best of our knowledge,and the training time is reduced by 3/5.

Key words: music source separation, Unet, time domain end-to-end separation model, simple recurrent unit, squeeze-and-excitation, group normalization

中图分类号: