Journal of South China University of Technology(Natural Science Edition) ›› 2025, Vol. 53 ›› Issue (11): 9-17.doi: 10.12141/j.issn.1000-565X.240524

• Computer Science & Technology • Previous Articles     Next Articles

Design and Optimization of Fast Fourier Transform Algorithm Based on Ascend NPU

LU Lu1, WANG Yuanfei1, LIANG Zhihong2,3, SUO Siliang2,3   

  1. 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
    2.Electric Power Research Institute,CSG,Guangzhou 510663,Guangdong,China
    3.Guangdong Provincial Key Laboratory of Power System Network Security,Guangzhou 510663,Guangdong,China
  • Received:2024-10-29 Online:2025-11-25 Published:2025-02-28
  • About author:陆璐(1971—),男,教授,博士生导师,主要从事软件缺陷预测、高性能计算、深度学习训练推理加速等研究。E-mail: lul@scut.edu.cn
  • Supported by:
    the Natural Science Foundation of Guangdong Province(2024A1515010204)

Abstract:

As a fundamental algorithm in scientific computing and signal processing, fast Fourier transform (FFT) has been widely applied to such fields as digital signal processing, image processing, deep learning. With the growth of data scale and the increasing demand for processing power, optimizing FFT algorithms on emerging hardware platforms has become particularly crucial. This paper conducts an in-depth analysis of the architectural cha-racteristics of Ascend NPU and their impacts on FFT algorithm optimization. Based on the matrix-computation-based Stockham FFT algorithm, a series of innovative optimization strategies are proposed: (1) A heuristic radix selection algorithm is designed to provide effective radix sequence combinations for different input sizes; (2) An efficient computation flow for single-iteration FFT without real-imaginary separation is developed, significantly reducing the global memory access overhead; (3) An on-chip cache-based data reading optimization strategy is proposed, greatly improving data access speed; (4) A data layout optimization method for multiple iterations is designed, effectively enhancing overall memory access efficiency. Experimental results on Ascend Atlas 800 platform equipped with Ascend 910 AI processor demonstrate that the proposed optimization strategies achieve an average speedup of 4.61 compared to non-optimized implementations. Independent performance analysis and validation of each optimization strategy demonstrate that the individual average speedup ratio ranges from 1.42 to 3.52. This research provides a technical references for implementing efficient FFT algorithms on emerging NPU architectures.

Key words: fast Fourier transform, Ascend NPU, heterogeneous computing, high-performance computing

CLC Number: