Journal of South China University of Technology(Natural Science) >
Design and Optimization of Fast Fourier Transform Algorithm Based on Ascend NPU
Received date: 2024-10-29
Online published: 2025-02-25
Supported by
the Natural Science Foundation of Guangdong Province(2024A1515010204)
As a fundamental algorithm in scientific computing and signal processing, fast Fourier transform (FFT) has been widely applied to such fields as digital signal processing, image processing, deep learning. With the growth of data scale and the increasing demand for processing power, optimizing FFT algorithms on emerging hardware platforms has become particularly crucial. This paper conducts an in-depth analysis of the architectural cha-racteristics of Ascend NPU and their impacts on FFT algorithm optimization. Based on the matrix-computation-based Stockham FFT algorithm, a series of innovative optimization strategies are proposed: (1) A heuristic radix selection algorithm is designed to provide effective radix sequence combinations for different input sizes; (2) An efficient computation flow for single-iteration FFT without real-imaginary separation is developed, significantly reducing the global memory access overhead; (3) An on-chip cache-based data reading optimization strategy is proposed, greatly improving data access speed; (4) A data layout optimization method for multiple iterations is designed, effectively enhancing overall memory access efficiency. Experimental results on Ascend Atlas 800 platform equipped with Ascend 910 AI processor demonstrate that the proposed optimization strategies achieve an average speedup of 4.61 compared to non-optimized implementations. Independent performance analysis and validation of each optimization strategy demonstrate that the individual average speedup ratio ranges from 1.42 to 3.52. This research provides a technical references for implementing efficient FFT algorithms on emerging NPU architectures.
LU Lu , WANG Yuanfei , LIANG Zhihong , SUO Siliang . Design and Optimization of Fast Fourier Transform Algorithm Based on Ascend NPU[J]. Journal of South China University of Technology(Natural Science), 2025 , 53(11) : 9 -17 . DOI: 10.12141/j.issn.1000-565X.240524
| [1] | COOLEY J W, TUKEY J W .An algorithm for the machine calculation of complex Fourier series[J].Ma-thematics of Computation,1965,19(90):297-301. |
| [2] | OPPENHEIM A V .Discrete-time signal processing[M].[S. l.]:Pearson Education India,1999. |
| [3] | J?HNE B .Digital image processing[M].Berlin:SpringerScience & Business Media,2005. |
| [4] | DUHAMEL P, VETTERLI M .Fast Fourier transforms: a tutorial review and a state of the art[J].Signal Processing,1990,19(4):259-299. |
| [5] | ZHANG Y, LI X .Fast convolutional neural networks with fine-grained FFTs[C]∥Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques.New York:Association for Computing Machinery,2020:255-265. |
| [6] | TOLMACHEV D .VkFFT—a performant,cross-platform and open-source GPU FFT library[J].IEEE Access,2023,11:12039-12058. |
| [7] | HU Y, LU L, LI C .Memory-accelerated parallel method for multidimensional fast Fourier implementation on GPU[J].The Journal of Supercomputing,2022,78(16):18189-18208. |
| [8] | LEE J, KIM D .Large-scale 3D fast Fourier transform computation on a GPU[J].ETRI Journal,2023,45(6):1035-1045. |
| [9] | ZHAO Y, LIU F, MA W,et al .MFFT: a GPU accelerated highly efficient mixed-precision large-scale FFT framework[J].ACM Transactions on Architecture and Code Optimization,2023,20(3):1-23. |
| [10] | MARKIDIS S, DER CHIEN S W, LAURE E,et al .NVIDIA tensor core programmability,performance & precision[C]∥Proceedings of 2018 IEEE International Parallel and Distributed Processing Symposium Workshops.Vancouver:IEEE,2018:522-531. |
| [11] | SORNA A, CHENG X, D’ AZEVEDO E,et al .Optimizing the fast Fourier transform using mixed precision on tensor core hardware[C]∥Proceedings of 2018 IEEE 25th International Conference on High Performance Computing Workshops.Bengaluru:IEEE,2018:3-7. |
| [12] | CHENG X, SORNA A, D’ AZEVEDO E,et al .Accelerating 2D FFT:exploit GPU tensor cores through mixed-precision[C]∥Proceedings of the International Conference for High Performance Computing,Networking,Storage,and Analysis.Dallas:ACM,2018. |
| [13] | DURRANI S, CHUGHTAI M S, HIDAYETOGLU M,et al .Accelerating Fourier and number theoretic transforms using tensor cores and warp shuffles[C]∥Proceedings of 2021 30th International Conference on Pa-rallel Architectures and Compilation Techniques.Atlanta:IEEE,2021:345-355. |
| [14] | PISHA L, LIGOWSKI ? .Accelerating non-power-of-2 size Fourier transforms with GPU tensor cores[C]∥Proceedings of 2021 IEEE International Parallel and Distributed Processing Symposium.Portland:IEEE,2021:507-516. |
| [15] | LI B, CHENG S, LIN J .TcFFT:a fast half-precision FFT library for NVIDIA tensor cores[C]∥Proceedings of 2021 IEEE International Conference on Cluster Computing.Portland:IEEE,2021:1-11. |
| [16] | JOUPPI N P, YOUNG C, PATIL N,et al .In-datacenter performance analysis of a tensor processing unit[C]∥Proceedings of the 44th Annual International Symposium on Computer Architecture.New York:Association for Computing Machinery,2017:1-12. |
| [17] | ABADI M, BARHAM P, CHEN J,et al .TensorFlow:a system for large-scale machine learning[C]∥Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation.Savannah:USENIX Association,2016:265-283. |
| [18] | LU T, CHEN Y F, HECHTMAN B,et al .Large-scale discrete Fourier transform on TPUs[J].IEEE Access,2021,9:93422-93432. |
| [19] | LU T, MARIN T, ZHUO Y,et al .Nonuniform fast Fourier transform on TPUs[C]∥Proceedings of 2021 IEEE 18th International Symposium on Biomedical Imaging.Nice:IEEE,2021:783-787. |
| [20] | LI Q, ZUO D, FENG Y,et al .Research on high-performance Fourier transform algorithms based on the NPU[J].Applied Sciences,2024,14(1):405/1-19. |
/
| 〈 |
|
〉 |