电子、通信与自动控制

结合SE与BiSRU的Unet的音乐源分离方法

展开
  • 天津大学 微电子学院,天津 300072
张瑞峰(1974-),男,博士,副教授,主要从事机器视觉与音频处理研究。E-mail:zhangruifeng@tju.edu.cn

收稿日期: 2020-09-30

  修回日期: 2020-12-29

  网络出版日期: 2021-01-11

基金资助

国家自然科学基金资助项目(61471263);天津市自然科学基金资助项目(16JCZDJC31100)

Music Source Separation Method Based on Unet Combining SE and BiSRU

Expand
  • School of Microelectronics,Tianjin University,Tianjin 300072,China
张瑞峰(1974-),男,博士,副教授,主要从事机器视觉与音频处理研究。E-mail:zhangruifeng@tju.edu.cn

Received date: 2020-09-30

  Revised date: 2020-12-29

  Online published: 2021-01-11

Supported by

Supported by the National Natural Science Foundation of China (61471263) and the Natural Science Foundation of Tianjin (16JCZDJC31100)

摘要

音乐源分离在音乐信息检索领域有着重要的研究价值。传统音乐源分离方法存在依赖假设、模型复杂度有限、表示能力不足等问题。能应对这些问题的时域深度学习端到端网络模型训练耗时长,且分离性能有待提升。为进一步改善时域端到端分离模型的表示能力和计算效率,在目前时域分离性能最优的Demucs模型基础上进行改进,提出了一种端对端网络Unet-SE-BiSRU。该模型在广义编码层和解码层中引入了注意力机制,采用挤压-激励块(SE)根据待分离音频的种类有选择地提取特征;在一维卷积后增加组归一化,以应对在学习过程中可能出现的梯度爆炸或梯度消失问题;将双向长短期记忆网络改进为双向简单循环单元(BiSRU),进一步提高了学习的并行性,且降低了模型参数量。实验结果表明,改进后的网络模型的信噪比指标提升了0.34dB,在目前检索到的文献的时域端对端方法中取得了最好的分离性能,并且训练时间缩短为源模型的2/5。

本文引用格式

张瑞峰, 白金桐, 关欣, 等 . 结合SE与BiSRU的Unet的音乐源分离方法[J]. 华南理工大学学报(自然科学版), 2021 , 49(11) : 106 -115,134 . DOI: 10.12141/j.issn.1000-565X.200593

Abstract

Music source separation is one of the most important research topics in the field of music information retrieval.Traditional music source separation methods have shortcomings,such as hypothesis dependence,limited model complexity,and poor representation ability.To resolve these problems,it takes a long time to train the time-domain end-to-end deep learning network model,and the separation performance still needs to be improved.Therefore,in order to further optimize the representation ability and computational efficiency of the time domain end-to-end separation model,the study proposed an end-to-end network Unet-SE-BiSRU based on the Demucs model which has the best performance in time domain separation at present.Attention mechanism was introduced into the generalized coding layer and decoding layer,and the squeezing-excitation block(SE) was used to extract features selectively according to the type of audio to be separated.To deal with gradient explosion or disappearance that may occur in the learning process,a group normalization was added after one-dimensional con-volution.The bidirectional long short-term memory network was refined to a bidirectional simple recurrent unit(BiSRU),which improves the parallelism of learning and reduces the amount of model parameters.The experimental results show that the signal-noise ratio of the improved network model is improved by 0.34dB,which is the best one among the time-domain end-to end methods to the best of our knowledge,and the training time is reduced by 3/5.
文章导航

/