华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (11): 9-17.doi: 10.12141/j.issn.1000-565X.240524

• 计算机科学与技术 • 上一篇    下一篇

基于昇腾NPU的快速傅里叶变换算法设计与优化

陆璐1, 王远飞1, 梁志宏2,3, 索思亮2,3   

  1. 1.华南理工大学 计算机科学与工程学院,广东 广州 510006
    2.南方电网科学研究院有限责任公司,广东 广州 510663
    3.广东省电力系统网络安全企业重点实验室,广东 广州 510663
  • 收稿日期:2024-10-29 出版日期:2025-11-25 发布日期:2025-02-28
  • 作者简介:陆璐(1971—),男,教授,博士生导师,主要从事软件缺陷预测、高性能计算、深度学习训练推理加速等研究。E-mail: lul@scut.edu.cn
  • 基金资助:
    广东省自然科学基金项目(2024A1515010204);南方电网科学研究院有限责任公司项目(1500002024030103XA00063)

Design and Optimization of Fast Fourier Transform Algorithm Based on Ascend NPU

LU Lu1, WANG Yuanfei1, LIANG Zhihong2,3, SUO Siliang2,3   

  1. 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
    2.Electric Power Research Institute,CSG,Guangzhou 510663,Guangdong,China
    3.Guangdong Provincial Key Laboratory of Power System Network Security,Guangzhou 510663,Guangdong,China
  • Received:2024-10-29 Online:2025-11-25 Published:2025-02-28
  • About author:陆璐(1971—),男,教授,博士生导师,主要从事软件缺陷预测、高性能计算、深度学习训练推理加速等研究。E-mail: lul@scut.edu.cn
  • Supported by:
    the Natural Science Foundation of Guangdong Province(2024A1515010204)

摘要:

快速傅里叶变换(Fast Fourier Transform,FFT)作为科学计算和信号处理领域的核心算法,已广泛应用于数字信号处理、图像处理、深度学习等领域。随着数据规模的增长和处理需求的提高,在新型硬件平台上优化FFT算法显得尤为重要。该文深入分析了昇腾NPU的架构特点及其对FFT算法优化的影响,基于矩阵运算形式的Stockham FFT算法,提出了一系列创新性优化策略:设计了启发式radix选择算法,针对不同输入规模提供较优的radix序列组合;针对单次迭代FFT,开发了无需虚实分离的高效计算流程,显著减少了全局内存访问开销;提出了基于片上缓存的数据读取优化策略,大幅提升了数据访问速度;为多次迭代设计了数据布局优化方法,有效改善了整体访存效率。在搭载昇腾910 AI处理器的昇腾Atlas 800平台上的实验结果表明,该文提出的优化策略相比无优化基准实现了4.61的平均加速比;对各项优化策略进行的独立性能分析和验证表明,各单项优化策略的平均加速比为1.42~3.52。研究结果为在新型NPU架构上实现高效FFT算法提供了技术参考。

关键词: 快速傅里叶变换, 昇腾NPU, 异构计算, 高性能计算

Abstract:

As a fundamental algorithm in scientific computing and signal processing, fast Fourier transform (FFT) has been widely applied to such fields as digital signal processing, image processing, deep learning. With the growth of data scale and the increasing demand for processing power, optimizing FFT algorithms on emerging hardware platforms has become particularly crucial. This paper conducts an in-depth analysis of the architectural cha-racteristics of Ascend NPU and their impacts on FFT algorithm optimization. Based on the matrix-computation-based Stockham FFT algorithm, a series of innovative optimization strategies are proposed: (1) A heuristic radix selection algorithm is designed to provide effective radix sequence combinations for different input sizes; (2) An efficient computation flow for single-iteration FFT without real-imaginary separation is developed, significantly reducing the global memory access overhead; (3) An on-chip cache-based data reading optimization strategy is proposed, greatly improving data access speed; (4) A data layout optimization method for multiple iterations is designed, effectively enhancing overall memory access efficiency. Experimental results on Ascend Atlas 800 platform equipped with Ascend 910 AI processor demonstrate that the proposed optimization strategies achieve an average speedup of 4.61 compared to non-optimized implementations. Independent performance analysis and validation of each optimization strategy demonstrate that the individual average speedup ratio ranges from 1.42 to 3.52. This research provides a technical references for implementing efficient FFT algorithms on emerging NPU architectures.

Key words: fast Fourier transform, Ascend NPU, heterogeneous computing, high-performance computing

中图分类号: