Journal of South China University of Technology(Natural Science Edition) ›› 2025, Vol. 53 ›› Issue (9): 48-58.doi: 10.12141/j.issn.1000-565X.240498

• Computer Science & Technology • Previous Articles     Next Articles

Design and Optimization of Small-Batch Matrix Multiplication Based on Matrix Core

LU Lu1, ZHAO Rong1, LIANG Zhihong2, SUO Siliang2   

  1. 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
    2.CSG Electric Power Research Institute/ Guangdong Provincial Key Laboratory of Power System Network Security,Guangzhou 510623,Guangdong,China
  • Received:2024-10-09 Online:2025-09-25 Published:2025-04-21
  • About author:陆璐(1971—),男,博士,教授,主要从事软件缺陷预测、高性能计算和深度学习训练推理加速研究。E-mail: lul@scut.edu.cn
  • Supported by:
    the Natural Science Foundation of Guangdong Province(2024A1515010204)

Abstract:

General Matrix Multiplication (GEMM) is one of the most important operations in linear algebra, serving as the backbone for numerous applications in machine learning, scientific computing, and signal processing. In particular, FP16 batch GEMM has become a core operation in deep learning frameworks due to its efficiency in training and inference. However, current implementations on AMD GPUs (e.g., CDNA/MI200 architectures with Matrix Cores) suffer from suboptimal memory access and low compute utilization, limiting performance in high-throughput scenarios. Therefore, this paper proposed a GPU optimization scheme for half-precision batch GEMM (HGEMM). In terms of blocking strategy, it allocates equal memory access and computational loads to threads based on input matrix sizes, while enabling each thread to compute multiple matrix multiplications to improve arithmetic unit utilization. For memory access optimization, it trades redundant data reads for uniform memory access patterns per thread to facilitate compiler optimization, ensuring overlapping of memory and computation time. For extremely small-batch HGEMM with matrix dimensions smaller than 16, the proposed method employs a 4 × 4 × 4 Matrix Core and its corresponding tiling scheme to enhance memory performance while reducing computational resource wastage, and provides the option of whether to use shared memory to achieve the highest performance. This paper compares the performance of this scheme with two operators of rocBLAS on the AMD GPU MI210 platform. The results show that the ave-rage performance of this scheme on AMD GPU MI210 is 4.14 times that of rocBLASHGEMMBatched and 4.96 times that of rocBLASGEMMExBatched. For extremely small-batch HGEMM, the average performance is 18.60 times that of rocBLASHGEMMBatched and 14.02 times that of rocBLASGEMMExBatched.

Key words: graphics processing unit, Matrix Core, matrix multiplication, memory access optimization

CLC Number: