Computer Science & Technology

Design and Optimization of Fast Fourier Transform Algorithm Based on Ascend NPU

  • LU Lu ,
  • WANG Yuanfei ,
  • LIANG Zhihong ,
  • SUO Siliang
Expand
  • 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
    2.Electric Power Research Institute,CSG,Guangzhou 510663,Guangdong,China
    3.Guangdong Provincial Key Laboratory of Power System Network Security,Guangzhou 510663,Guangdong,China
陆璐(1971—),男,教授,博士生导师,主要从事软件缺陷预测、高性能计算、深度学习训练推理加速等研究。E-mail: lul@scut.edu.cn

Received date: 2024-10-29

  Online published: 2025-02-25

Supported by

the Natural Science Foundation of Guangdong Province(2024A1515010204)

Abstract

As a fundamental algorithm in scientific computing and signal processing, fast Fourier transform (FFT) has been widely applied to such fields as digital signal processing, image processing, deep learning. With the growth of data scale and the increasing demand for processing power, optimizing FFT algorithms on emerging hardware platforms has become particularly crucial. This paper conducts an in-depth analysis of the architectural cha-racteristics of Ascend NPU and their impacts on FFT algorithm optimization. Based on the matrix-computation-based Stockham FFT algorithm, a series of innovative optimization strategies are proposed: (1) A heuristic radix selection algorithm is designed to provide effective radix sequence combinations for different input sizes; (2) An efficient computation flow for single-iteration FFT without real-imaginary separation is developed, significantly reducing the global memory access overhead; (3) An on-chip cache-based data reading optimization strategy is proposed, greatly improving data access speed; (4) A data layout optimization method for multiple iterations is designed, effectively enhancing overall memory access efficiency. Experimental results on Ascend Atlas 800 platform equipped with Ascend 910 AI processor demonstrate that the proposed optimization strategies achieve an average speedup of 4.61 compared to non-optimized implementations. Independent performance analysis and validation of each optimization strategy demonstrate that the individual average speedup ratio ranges from 1.42 to 3.52. This research provides a technical references for implementing efficient FFT algorithms on emerging NPU architectures.

Cite this article

LU Lu , WANG Yuanfei , LIANG Zhihong , SUO Siliang . Design and Optimization of Fast Fourier Transform Algorithm Based on Ascend NPU[J]. Journal of South China University of Technology(Natural Science), 2025 , 53(11) : 9 -17 . DOI: 10.12141/j.issn.1000-565X.240524

References

[1] COOLEY J W, TUKEY J W .An algorithm for the machine calculation of complex Fourier series[J].Ma-thematics of Computation196519(90):297-301.
[2] OPPENHEIM A V .Discrete-time signal processing[M].[S. l.]:Pearson Education India,1999
[3] J?HNE B .Digital image processing[M].Berlin:SpringerScience & Business Media,2005
[4] DUHAMEL P, VETTERLI M .Fast Fourier transforms: a tutorial review and a state of the art[J].Signal Processing199019(4):259-299.
[5] ZHANG Y, LI X .Fast convolutional neural networks with fine-grained FFTs[C]∥Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques.New York:Association for Computing Machinery,2020:255-265.
[6] TOLMACHEV D .VkFFT—a performant,cross-platform and open-source GPU FFT library[J].IEEE Access202311:12039-12058.
[7] HU Y, LU L, LI C .Memory-accelerated parallel method for multidimensional fast Fourier implementation on GPU[J].The Journal of Supercomputing202278(16):18189-18208.
[8] LEE J, KIM D .Large-scale 3D fast Fourier transform computation on a GPU[J].ETRI Journal202345(6):1035-1045.
[9] ZHAO Y, LIU F, MA W,et al .MFFT: a GPU accelerated highly efficient mixed-precision large-scale FFT framework[J].ACM Transactions on Architecture and Code Optimization202320(3):1-23.
[10] MARKIDIS S, DER CHIEN S W, LAURE E,et al .NVIDIA tensor core programmability,performance & precision[C]∥Proceedings of 2018 IEEE International Parallel and Distributed Processing Symposium Workshops.Vancouver:IEEE,2018:522-531.
[11] SORNA A, CHENG X, D’ AZEVEDO E,et al .Optimizing the fast Fourier transform using mixed precision on tensor core hardware[C]∥Proceedings of 2018 IEEE 25th International Conference on High Performance Computing Workshops.Bengaluru:IEEE,2018:3-7.
[12] CHENG X, SORNA A, D’ AZEVEDO E,et al .Accelerating 2D FFT:exploit GPU tensor cores through mixed-precision[C]∥Proceedings of the International Conference for High Performance Computing,Networking,Storage,and Analysis.Dallas:ACM,2018
[13] DURRANI S, CHUGHTAI M S, HIDAYETOGLU M,et al .Accelerating Fourier and number theoretic transforms using tensor cores and warp shuffles[C]∥Proceedings of 2021 30th International Conference on Pa-rallel Architectures and Compilation Techniques.Atlanta:IEEE,2021:345-355.
[14] PISHA L, LIGOWSKI ? .Accelerating non-power-of-2 size Fourier transforms with GPU tensor cores[C]∥Proceedings of 2021 IEEE International Parallel and Distributed Processing Symposium.Portland:IEEE,2021:507-516.
[15] LI B, CHENG S, LIN J .TcFFT:a fast half-precision FFT library for NVIDIA tensor cores[C]∥Proceedings of 2021 IEEE International Conference on Cluster Computing.Portland:IEEE,2021:1-11.
[16] JOUPPI N P, YOUNG C, PATIL N,et al .In-datacenter performance analysis of a tensor processing unit[C]∥Proceedings of the 44th Annual International Symposium on Computer Architecture.New York:Association for Computing Machinery,2017:1-12.
[17] ABADI M, BARHAM P, CHEN J,et al .TensorFlow:a system for large-scale machine learning[C]∥Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation.Savannah:USENIX Association,2016:265-283.
[18] LU T, CHEN Y F, HECHTMAN B,et al .Large-scale discrete Fourier transform on TPUs[J].IEEE Access20219:93422-93432.
[19] LU T, MARIN T, ZHUO Y,et al .Nonuniform fast Fourier transform on TPUs[C]∥Proceedings of 2021 IEEE 18th International Symposium on Biomedical Imaging.Nice:IEEE,2021:783-787.
[20] LI Q, ZUO D, FENG Y,et al .Research on high-performance Fourier transform algorithms based on the NPU[J].Applied Sciences202414(1):405/1-19.
Outlines

/