计算机科学与技术

基于昇腾NPU的快速傅里叶变换算法设计与优化

  • 陆璐 ,
  • 王远飞 ,
  • 梁志宏 ,
  • 索思亮
展开
  • 1.华南理工大学 计算机科学与工程学院,广东 广州 510006
    2.南方电网科学研究院有限责任公司,广东 广州 510663
    3.广东省电力系统网络安全企业重点实验室,广东 广州 510663
陆璐(1971—),男,教授,博士生导师,主要从事软件缺陷预测、高性能计算、深度学习训练推理加速等研究。E-mail: lul@scut.edu.cn
陆璐(1971—),男,教授,博士生导师,主要从事软件缺陷预测、高性能计算、深度学习训练推理加速等研究。E-mail: lul@scut.edu.cn

收稿日期: 2024-10-29

  网络出版日期: 2025-02-25

基金资助

广东省自然科学基金项目(2024A1515010204);南方电网科学研究院有限责任公司项目(1500002024030103XA00063)

Design and Optimization of Fast Fourier Transform Algorithm Based on Ascend NPU

  • LU Lu ,
  • WANG Yuanfei ,
  • LIANG Zhihong ,
  • SUO Siliang
Expand
  • 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
    2.Electric Power Research Institute,CSG,Guangzhou 510663,Guangdong,China
    3.Guangdong Provincial Key Laboratory of Power System Network Security,Guangzhou 510663,Guangdong,China
陆璐(1971—),男,教授,博士生导师,主要从事软件缺陷预测、高性能计算、深度学习训练推理加速等研究。E-mail: lul@scut.edu.cn

Received date: 2024-10-29

  Online published: 2025-02-25

Supported by

the Natural Science Foundation of Guangdong Province(2024A1515010204)

摘要

快速傅里叶变换(Fast Fourier Transform,FFT)作为科学计算和信号处理领域的核心算法,已广泛应用于数字信号处理、图像处理、深度学习等领域。随着数据规模的增长和处理需求的提高,在新型硬件平台上优化FFT算法显得尤为重要。该文深入分析了昇腾NPU的架构特点及其对FFT算法优化的影响,基于矩阵运算形式的Stockham FFT算法,提出了一系列创新性优化策略:设计了启发式radix选择算法,针对不同输入规模提供较优的radix序列组合;针对单次迭代FFT,开发了无需虚实分离的高效计算流程,显著减少了全局内存访问开销;提出了基于片上缓存的数据读取优化策略,大幅提升了数据访问速度;为多次迭代设计了数据布局优化方法,有效改善了整体访存效率。在搭载昇腾910 AI处理器的昇腾Atlas 800平台上的实验结果表明,该文提出的优化策略相比无优化基准实现了4.61的平均加速比;对各项优化策略进行的独立性能分析和验证表明,各单项优化策略的平均加速比为1.42~3.52。研究结果为在新型NPU架构上实现高效FFT算法提供了技术参考。

本文引用格式

陆璐 , 王远飞 , 梁志宏 , 索思亮 . 基于昇腾NPU的快速傅里叶变换算法设计与优化[J]. 华南理工大学学报(自然科学版), 2025 , 53(11) : 9 -17 . DOI: 10.12141/j.issn.1000-565X.240524

Abstract

As a fundamental algorithm in scientific computing and signal processing, fast Fourier transform (FFT) has been widely applied to such fields as digital signal processing, image processing, deep learning. With the growth of data scale and the increasing demand for processing power, optimizing FFT algorithms on emerging hardware platforms has become particularly crucial. This paper conducts an in-depth analysis of the architectural cha-racteristics of Ascend NPU and their impacts on FFT algorithm optimization. Based on the matrix-computation-based Stockham FFT algorithm, a series of innovative optimization strategies are proposed: (1) A heuristic radix selection algorithm is designed to provide effective radix sequence combinations for different input sizes; (2) An efficient computation flow for single-iteration FFT without real-imaginary separation is developed, significantly reducing the global memory access overhead; (3) An on-chip cache-based data reading optimization strategy is proposed, greatly improving data access speed; (4) A data layout optimization method for multiple iterations is designed, effectively enhancing overall memory access efficiency. Experimental results on Ascend Atlas 800 platform equipped with Ascend 910 AI processor demonstrate that the proposed optimization strategies achieve an average speedup of 4.61 compared to non-optimized implementations. Independent performance analysis and validation of each optimization strategy demonstrate that the individual average speedup ratio ranges from 1.42 to 3.52. This research provides a technical references for implementing efficient FFT algorithms on emerging NPU architectures.

参考文献

[1] COOLEY J W, TUKEY J W .An algorithm for the machine calculation of complex Fourier series[J].Ma-thematics of Computation196519(90):297-301.
[2] OPPENHEIM A V .Discrete-time signal processing[M].[S. l.]:Pearson Education India,1999
[3] J?HNE B .Digital image processing[M].Berlin:SpringerScience & Business Media,2005
[4] DUHAMEL P, VETTERLI M .Fast Fourier transforms: a tutorial review and a state of the art[J].Signal Processing199019(4):259-299.
[5] ZHANG Y, LI X .Fast convolutional neural networks with fine-grained FFTs[C]∥Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques.New York:Association for Computing Machinery,2020:255-265.
[6] TOLMACHEV D .VkFFT—a performant,cross-platform and open-source GPU FFT library[J].IEEE Access202311:12039-12058.
[7] HU Y, LU L, LI C .Memory-accelerated parallel method for multidimensional fast Fourier implementation on GPU[J].The Journal of Supercomputing202278(16):18189-18208.
[8] LEE J, KIM D .Large-scale 3D fast Fourier transform computation on a GPU[J].ETRI Journal202345(6):1035-1045.
[9] ZHAO Y, LIU F, MA W,et al .MFFT: a GPU accelerated highly efficient mixed-precision large-scale FFT framework[J].ACM Transactions on Architecture and Code Optimization202320(3):1-23.
[10] MARKIDIS S, DER CHIEN S W, LAURE E,et al .NVIDIA tensor core programmability,performance & precision[C]∥Proceedings of 2018 IEEE International Parallel and Distributed Processing Symposium Workshops.Vancouver:IEEE,2018:522-531.
[11] SORNA A, CHENG X, D’ AZEVEDO E,et al .Optimizing the fast Fourier transform using mixed precision on tensor core hardware[C]∥Proceedings of 2018 IEEE 25th International Conference on High Performance Computing Workshops.Bengaluru:IEEE,2018:3-7.
[12] CHENG X, SORNA A, D’ AZEVEDO E,et al .Accelerating 2D FFT:exploit GPU tensor cores through mixed-precision[C]∥Proceedings of the International Conference for High Performance Computing,Networking,Storage,and Analysis.Dallas:ACM,2018
[13] DURRANI S, CHUGHTAI M S, HIDAYETOGLU M,et al .Accelerating Fourier and number theoretic transforms using tensor cores and warp shuffles[C]∥Proceedings of 2021 30th International Conference on Pa-rallel Architectures and Compilation Techniques.Atlanta:IEEE,2021:345-355.
[14] PISHA L, LIGOWSKI ? .Accelerating non-power-of-2 size Fourier transforms with GPU tensor cores[C]∥Proceedings of 2021 IEEE International Parallel and Distributed Processing Symposium.Portland:IEEE,2021:507-516.
[15] LI B, CHENG S, LIN J .TcFFT:a fast half-precision FFT library for NVIDIA tensor cores[C]∥Proceedings of 2021 IEEE International Conference on Cluster Computing.Portland:IEEE,2021:1-11.
[16] JOUPPI N P, YOUNG C, PATIL N,et al .In-datacenter performance analysis of a tensor processing unit[C]∥Proceedings of the 44th Annual International Symposium on Computer Architecture.New York:Association for Computing Machinery,2017:1-12.
[17] ABADI M, BARHAM P, CHEN J,et al .TensorFlow:a system for large-scale machine learning[C]∥Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation.Savannah:USENIX Association,2016:265-283.
[18] LU T, CHEN Y F, HECHTMAN B,et al .Large-scale discrete Fourier transform on TPUs[J].IEEE Access20219:93422-93432.
[19] LU T, MARIN T, ZHUO Y,et al .Nonuniform fast Fourier transform on TPUs[C]∥Proceedings of 2021 IEEE 18th International Symposium on Biomedical Imaging.Nice:IEEE,2021:783-787.
[20] LI Q, ZUO D, FENG Y,et al .Research on high-performance Fourier transform algorithms based on the NPU[J].Applied Sciences202414(1):405/1-19.
文章导航

/