华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (9): 48-58.doi: 10.12141/j.issn.1000-565X.240498

• 计算机科学与技术 • 上一篇    下一篇

基于Matrix Core的小尺寸批量矩阵乘法设计与优化

 陆璐1 赵容1 梁志宏2,3 索思亮2,3   

  1. 1.华南理工大学 计算机科学与工程学院,广东 广州 510006;

    2.南方电网科学研究院有限责任公司;

    3. 广东省电力系统网络安全企业重点实验室

  • 出版日期:2025-09-25 发布日期:2025-04-21

Design and Optimization of Batch Matrix Multiplication for Small Size Using Half-Precision on Matrix Core

LU Lu1 ZHAO Rong1 LIANG Zhihong2,3 SUO Siliang2,3   

  1. 1. School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, Guangdong, China;

    2. Electric Power Research Institute, CSG, Guangzhou 510000, Guangdong, China; 

    3. Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou 51000, Guangdong, China

  • Online:2025-09-25 Published:2025-04-21

摘要:

通用矩阵乘法(GEMM)是线性代数中最重要的运算,来自不同科学领域的许多应用程序都将其关键部分转换为使用GEMM。GEMM 广泛应用于大模型、机器学习、科学计算和信号处理等等领域。特别是半精度的批处理GEMM (即FP16)一直是许多深度学习框架的核心操作。本文提出了一种半精度批处理GEMM (HGEMM)的GPU优化方案。分块策略方面,该方案提供矩阵大小亲和的分块策略,保证每个波前分配到相同的工作量和计算量;线程同时计算多个矩阵乘法,提高计算单元的利用率。访存优化方面:以多读数据为代价,为每个线程分配相同访存量以便于编译器优化,保证访存和计算时间相互掩盖。对于矩阵尺寸小于16的极小尺寸批处理HGEMM,本文利用4x4x4的Matrix Core和其对应的分块方案,在提升访存性能的同时减少计算的浪费,并提供是否使用共享内存的选项来达到最高性能。本文在AMD GPU MI210平台上将该方案与rocBLAS的两个算子进行了性能对比,结果表明,该方案在AMD GPU MI210上平均性能为rocBLAS hgemm batched的4.14倍,为rocBLAS gemm ex batched的4.96倍。尤其对于极小尺寸批处理HGEMM平均性能为rocBLAS hgemm batched的18.60倍,为 rocBLAS gemm ex batched的14.02倍。

关键词: GPU, Matrix Core, 矩阵乘法, 访存优化

Abstract:

General matrix multiplication (GEMM) is the most important operation in linear algebra, and many applications from different scientific fields have converted their key parts to use GEMM. GEMM is widely used in large models, machine learning, scientific computing, signal processing, and other fields. In particular, half-precision batch GEMM (i.e., FP16) has been the core operation of many deep learning frameworks. This paper proposes a GPU optimization scheme for half-precision batch GEMM (HGEMM). In terms of blocking strategy, this scheme provides a matrix size-affinitive blocking strategy to ensure that each wavefront is assigned the same workload and computation; threads calculate multiple matrix multiplications simultaneously to improve the utilization of computing units. In terms of memory access optimization: at the cost of multiple data reads, the same amount of memory access is allocated to each thread to facilitate compiler optimization, ensuring that memory access and computation time cover each other. For extremely small-size batch HGEMM with a matrix size of less than 16, this paper uses a 4x4x4 Matrix Core and its corresponding blocking scheme to improve memory access performance while reducing computation waste, and provides an option to use shared memory to achieve the highest performance. This paper compares the performance of this scheme with two operators of rocBLAS on the AMD GPU MI210 platform. The results show that the average performance of this scheme on AMD GPU MI210 is 4.14 times that of rocBLAS hgemm batched and 4.96 times that of rocBLAS gemm ex batched. In particular, for extremely small batch sizes, the average performance of HGEMM is 18.60 times that of rocBLAS hgemm batched and 14.02 times that of rocBLAS gemm ex batched.

Key words: GPU, matrix core, matrix multiplication, memory access optimization