Journal of South China University of Technology(Natural Science Edition) ›› 2025, Vol. 53 ›› Issue (3): 20-30.doi: 10.12141/j.issn.1000-565X.240035

• Computer Science & Technology • Previous Articles     Next Articles

Design and Optimization of High-Performance Multi-Dimensional FFT Based on Matrix Core

LU Lu1,2(), ZHU Songxiang1, TIAN Qingyan3, LIN Haishan3, GUO Yijie1   

  1. 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
    2.Pengcheng Laboratory,Shenzhen 518000,Guangdong,China
    3.Tunnel Engineering Safety and Emergency Support Technology and Equipment Laboratory of Guangdong Province,Guangzhou 510440,Guangdong,China
  • Received:2024-01-12 Online:2025-03-10 Published:2024-04-23
  • Supported by:
    the Key-Area R & D Program of Guangdong Province(2022B0101070001)

Abstract:

Fast Fourier transform (FFT) algorithm finds widespread application in scientific computing and related fields. To fully leverage the computational power of the GPU and further enhance the performance of FFT calculations, this paper proposed a high-performance multi-dimensional FFT computation scheme based on the Matrix Core for the matrix form of Stockham FFT. In terms of computational optimization, this scheme utilizes Matrix Core to accelerate matrix multiplications in FFT computation while leveraging compiler intrinsic instructions to perform small-grained matrix multiply-accumulate operations, enabling Matrix Core to support FFT computations of more sizes. To minimize memory access, the proposed scheme directly performs matrix element-wise multiplication operations in the registers according to the distribution pattern of Matrix Core’s data across thread registers. It also mitigates bank conflicts by reordering data in shared memory, adopts a double-buffering strategy to alleviate access bottlenecks, and proposes an efficient matrix transposition strategy to accelerate multidimensional FFT computations. In this paper, the proposed scheme was compared to the well-known high-performance FFT computation libraries rocFFT and VkFFT on the AMD MI250 GPU platform. The results demonstrate that the proposed scheme outperforms rocFFT and VkFFT in terms of average computational performance for 1-dimensional, 2-dimensional, and 3-dimensional FFTs on the AMD MI250 GPU platform. For 3D FFT calculation, this method has an average performance that is 1.5 times faster than rocFFT and 2.0 times faster than VkFFT, demonstrating significant performance improvements.

Key words: graphics processing unit, Matrix Core, fast Fourier transform, matrix multiplication

CLC Number: