Journal of South China University of Technology (Natural Science Edition) ›› 2021, Vol. 49 ›› Issue (11): 106-115,134.doi: 10.12141/j.issn.1000-565X.200593

Special Issue: 2021年电子、通信与自动控制

• Electronics, Communication & Automation Technology • Previous Articles     Next Articles

Music Source Separation Method Based on Unet Combining SE and BiSRU

ZHANG Ruifeng BAI Jintong GUAN Xin LI Qiang   

  1. School of Microelectronics,Tianjin University,Tianjin 300072,China
  • Received:2020-09-30 Revised:2020-12-29 Online:2021-11-25 Published:2021-11-01
  • Contact: 关欣(1977-),女,博士,副教授,主要从事声音与音乐计算研究。 E-mail:guanxin@tju.edu.cn
  • About author:张瑞峰(1974-),男,博士,副教授,主要从事机器视觉与音频处理研究。E-mail:zhangruifeng@tju.edu.cn
  • Supported by:
    Supported by the National Natural Science Foundation of China (61471263) and the Natural Science Foundation of Tianjin (16JCZDJC31100)

Abstract: Music source separation is one of the most important research topics in the field of music information retrieval.Traditional music source separation methods have shortcomings,such as hypothesis dependence,limited model complexity,and poor representation ability.To resolve these problems,it takes a long time to train the time-domain end-to-end deep learning network model,and the separation performance still needs to be improved.Therefore,in order to further optimize the representation ability and computational efficiency of the time domain end-to-end separation model,the study proposed an end-to-end network Unet-SE-BiSRU based on the Demucs model which has the best performance in time domain separation at present.Attention mechanism was introduced into the generalized coding layer and decoding layer,and the squeezing-excitation block(SE) was used to extract features selectively according to the type of audio to be separated.To deal with gradient explosion or disappearance that may occur in the learning process,a group normalization was added after one-dimensional con-volution.The bidirectional long short-term memory network was refined to a bidirectional simple recurrent unit(BiSRU),which improves the parallelism of learning and reduces the amount of model parameters.The experimental results show that the signal-noise ratio of the improved network model is improved by 0.34dB,which is the best one among the time-domain end-to end methods to the best of our knowledge,and the training time is reduced by 3/5.

Key words: music source separation, Unet, time domain end-to-end separation model, simple recurrent unit, squeeze-and-excitation, group normalization

CLC Number: