Design and Optimization of Batch Matrix Multiplication for Small Size Using Half-Precision on Matrix Core

doi:10.12141/j.issn.1000-565X.240498

Journal of South China University of Technology(Natural Science Edition) ›› 2025, Vol. 53 ›› Issue (9): 48-58.doi: 10.12141/j.issn.1000-565X.240498

• Computer Science & Technology • Previous Articles Next Articles

Design and Optimization of Batch Matrix Multiplication for Small Size Using Half-Precision on Matrix Core

LU Lu¹ ZHAO Rong¹ LIANG Zhihong^2,3 SUO Siliang^2,3

1. School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, Guangdong, China;

2. Electric Power Research Institute, CSG, Guangzhou 510000, Guangdong, China;

3. Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou 51000, Guangdong, China

Online:2025-09-25 Published:2025-04-21

Abstract

Abstract:

General matrix multiplication (GEMM) is the most important operation in linear algebra, and many applications from different scientific fields have converted their key parts to use GEMM. GEMM is widely used in large models, machine learning, scientific computing, signal processing, and other fields. In particular, half-precision batch GEMM (i.e., FP16) has been the core operation of many deep learning frameworks. This paper proposes a GPU optimization scheme for half-precision batch GEMM (HGEMM). In terms of blocking strategy, this scheme provides a matrix size-affinitive blocking strategy to ensure that each wavefront is assigned the same workload and computation; threads calculate multiple matrix multiplications simultaneously to improve the utilization of computing units. In terms of memory access optimization: at the cost of multiple data reads, the same amount of memory access is allocated to each thread to facilitate compiler optimization, ensuring that memory access and computation time cover each other. For extremely small-size batch HGEMM with a matrix size of less than 16, this paper uses a 4x4x4 Matrix Core and its corresponding blocking scheme to improve memory access performance while reducing computation waste, and provides an option to use shared memory to achieve the highest performance. This paper compares the performance of this scheme with two operators of rocBLAS on the AMD GPU MI210 platform. The results show that the average performance of this scheme on AMD GPU MI210 is 4.14 times that of rocBLAS hgemm batched and 4.96 times that of rocBLAS gemm ex batched. In particular, for extremely small batch sizes, the average performance of HGEMM is 18.60 times that of rocBLAS hgemm batched and 14.02 times that of rocBLAS gemm ex batched.

Key words: GPU, matrix core, matrix multiplication, memory access optimization

LU Lu, ZHAO Rong, LIANG Zhihong, et al. Design and Optimization of Batch Matrix Multiplication for Small Size Using Half-Precision on Matrix Core[J]. Journal of South China University of Technology(Natural Science Edition), 2025, 53(9): 48-58.

[1]	LU Lu, ZHU Songxiang, TIAN Qingyan, LIN Haishan, GUO Yijie. Design and Optimization of High-Performance Multi-Dimensional FFT Based on Matrix Core [J]. Journal of South China University of Technology(Natural Science Edition), 2025, 53(3): 20-30.
[2]	Zhou Wu Hu Yue-ming. Sub-Pixel Image Registration Algorithm Based on Phase Correlation and Image Resampling [J]. Journal of South China University of Technology (Natural Science Edition), 2010, 38(10): 68-73,78.

Design and Optimization of Batch Matrix Multiplication for Small Size Using Half-Precision on Matrix Core

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 2

Recommended Articles

Metrics

Comments