华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (11): 1-.doi: 10.12141/j.issn.1000-565X.240524

• 计算机科学与技术 •    

基于昇腾NPU的快速傅里叶变换设计与优化

陆璐1 王远飞梁志宏2 索思亮2   

  1. 1.华南理工大学 计算机科学与工程学院,广东 广州 510006;

    2.南方电网科学研究院有限责任公司/广东省电力系统网络安全企业重点实验室, 广东 广州 510663

  • 出版日期:2025-11-25 发布日期:2025-02-28

Design and Optimization of Fast Fourier Transform Based on Ascend NPU

LU Lu1 WANG Yuanfei1 LIANG Zhihong2 SUO Siliang2   

  1. 1. School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, Guangdong, China;

    2. Electric Power Research Institute, CSG/Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou 510663, Guangdong, China

  • Online:2025-11-25 Published:2025-02-28

摘要:

快速傅里叶变换(FFT)作为科学计算和信号处理领域的核心算法,已广泛应用于数字信号处理、图像处理、深度学习等多个领域。随着数据规模的增长和处理需求的提高,在新型硬件平台上优化FFT算法变得尤为重要。本研究深入分析了昇腾NPU架构特点及其对FFT算法优化的影响,基于矩阵运算形式的Stockham FFT算法,提出了一系列创新性优化策略:(1)设计了启发式radix选择算法,为不同输入规模提供较优的radix序列组合;(2)针对单次迭代FFT,开发了无需虚实分离的高效计算流程,显著减少了全局内存访问开销;(3)提出了基于片上缓存的数据读取优化策略,大幅提升了数据访问速度;(4)为多次迭代设计了数据布局优化方法,有效改善了整体访存效率。在搭载昇腾910 AI处理器的华为昇腾Atlas 800平台上的实验结果表明:与非优化实现相比,本文提出的优化策略使FFT计算速度平均提升4.61倍。通过对各项优化策略进行独立性能分析和验证,结果显示单项优化可实现1.42倍到3.52倍的显著加速,为在新型NPU架构上实现高效FFT算法提供了重要的技术参考。

关键词: 快速傅里叶变换, 昇腾NPU, 异构计算, 高性能计算

Abstract:

Fast Fourier Transform (FFT), as a fundamental algorithm in scientific computing and signal processing, has been widely applied in digital signal processing, image processing, deep learning, and various other fields. With the growing data scale and increasing processing demands, optimizing FFT algorithms on emerging hardware platforms has become increasingly crucial. This research first conducts an in-depth analysis of Ascend NPU architectural characteristics and their implications for FFT optimization. Based on the matrix-computation-based Stockham FFT algorithm, a series of innovative optimization strategies are proposed: (1) develops a heuristic radix selection algorithm to provide effective radix sequence combinations for different input sizes; (2) designs an efficient computation flow for single-iteration FFT that eliminates the need for real-imaginary separation, significantly reducing global memory access overhead; (3) presents an on-chip cache-based data reading optimization strategy, substantially improving data access speed; (4) implements a data layout optimization method for multiple iterations, effectively enhancing overall memory access efficiency. Experimental results on the Huawei Ascend Atlas 800 platform equipped with Ascend 910 AI processor demonstrate that the proposed optimization strategies achieve an average 4.61-fold speedup compared to non-optimized implementations. Through independent performance analysis and validation of each optimization strategy, results show significant individual speedups ranging from 1.42× to 3.52×, providing valuable technical references for implementing efficient FFT algorithms on emerging NPU architectures.

Key words: fast Fourier transform, ascend NPU, heterogeneous computing, high-performance computing