Journal of South China University of Technology(Natural Science Edition) ›› 2025, Vol. 53 ›› Issue (9): 48-58.doi: 10.12141/j.issn.1000-565X.240498

• Computer Science & Technology • Previous Articles     Next Articles

Design and Optimization of Batch Matrix Multiplication for Small Size Using Half-Precision on Matrix Core

LU Lu1 ZHAO Rong1 LIANG Zhihong2,3 SUO Siliang2,3   

  1. 1. School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, Guangdong, China;

    2. Electric Power Research Institute, CSG, Guangzhou 510000, Guangdong, China; 

    3. Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou 51000, Guangdong, China

  • Online:2025-09-25 Published:2025-04-21

Abstract:

General matrix multiplication (GEMM) is the most important operation in linear algebra, and many applications from different scientific fields have converted their key parts to use GEMM. GEMM is widely used in large models, machine learning, scientific computing, signal processing, and other fields. In particular, half-precision batch GEMM (i.e., FP16) has been the core operation of many deep learning frameworks. This paper proposes a GPU optimization scheme for half-precision batch GEMM (HGEMM). In terms of blocking strategy, this scheme provides a matrix size-affinitive blocking strategy to ensure that each wavefront is assigned the same workload and computation; threads calculate multiple matrix multiplications simultaneously to improve the utilization of computing units. In terms of memory access optimization: at the cost of multiple data reads, the same amount of memory access is allocated to each thread to facilitate compiler optimization, ensuring that memory access and computation time cover each other. For extremely small-size batch HGEMM with a matrix size of less than 16, this paper uses a 4x4x4 Matrix Core and its corresponding blocking scheme to improve memory access performance while reducing computation waste, and provides an option to use shared memory to achieve the highest performance. This paper compares the performance of this scheme with two operators of rocBLAS on the AMD GPU MI210 platform. The results show that the average performance of this scheme on AMD GPU MI210 is 4.14 times that of rocBLAS hgemm batched and 4.96 times that of rocBLAS gemm ex batched. In particular, for extremely small batch sizes, the average performance of HGEMM is 18.60 times that of rocBLAS hgemm batched and 14.02 times that of rocBLAS gemm ex batched.

Key words: GPU, matrix core, matrix multiplication, memory access optimization