华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (9): 48-58.doi: 10.12141/j.issn.1000-565X.240498

• 计算机科学与技术 • 上一篇    下一篇

基于Matrix Core的小尺寸批量矩阵乘法设计与优化

陆璐1, 赵容1, 梁志宏2, 索思亮2   

  1. 1.华南理工大学 计算机科学与工程学院,广东 广州 510006
    2.南方电网科学研究院有限责任公司/广东省 电力系统网络安全重点实验室,广东 广州 510623
  • 收稿日期:2024-10-09 出版日期:2025-09-25 发布日期:2025-04-21
  • 作者简介:陆璐(1971—),男,博士,教授,主要从事软件缺陷预测、高性能计算和深度学习训练推理加速研究。E-mail: lul@scut.edu.cn
  • 基金资助:
    广东省自然科学基金项目(2024A1515010204);南方电网科学研究院项目(1500002024030103XA00063)

Design and Optimization of Small-Batch Matrix Multiplication Based on Matrix Core

LU Lu1, ZHAO Rong1, LIANG Zhihong2, SUO Siliang2   

  1. 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
    2.CSG Electric Power Research Institute/ Guangdong Provincial Key Laboratory of Power System Network Security,Guangzhou 510623,Guangdong,China
  • Received:2024-10-09 Online:2025-09-25 Published:2025-04-21
  • About author:陆璐(1971—),男,博士,教授,主要从事软件缺陷预测、高性能计算和深度学习训练推理加速研究。E-mail: lul@scut.edu.cn
  • Supported by:
    the Natural Science Foundation of Guangdong Province(2024A1515010204)

摘要:

通用矩阵乘法(GEMM)是线性代数中最重要的运算,来自不同科学领域的许多应用程序都将其关键部分转换为使用GEMM的形式。GEMM广泛应用于大模型、机器学习、科学计算和信号处理等领域。特别是半精度的批处理GEMM(即FP16)一直是许多深度学习框架的核心操作。目前AMD GPU上半精度批处理GEMM的访存和计算利用率不足,急需优化。为此,该文提出了一种半精度批处理GEMM(HGEMM)的图形处理器(GPU)优化方案。分块策略方面,根据输入矩阵块大小为线程分配相同的访存量和计算量,同时线程计算多个矩阵乘法,以提高计算单元的利用率。访存优化方面,以多读数据为代价,为每个线程分配相同访存量以便于编译器优化,保证访存和计算时间相互掩盖。对于矩阵尺寸小于16的极小尺寸批处理HGEMM,该文利用4 × 4 × 4的Matrix Core及其对应的分块方案,在提升访存性能的同时减少Matrix Core计算资源的浪费,并提供是否使用共享内存的选项来达到最高性能。在AMD GPU MI210平台上,将该方案与rocBLAS的2个算子进行性能对比,结果表明:该方案在AMD GPU MI210上的平均性能为rocBLASHGEMMBatched的4.14倍,rocBLASGEMMExBatched的4.96倍;对于极小尺寸批处理HGEMM,平均性能为rocBLASHGEMMBatched的18.60倍,rocBLASGEMMExBatched的14.02倍。

关键词: 图形处理器, Matrix Core, 矩阵乘法, 访存优化

Abstract:

General Matrix Multiplication (GEMM) is one of the most important operations in linear algebra, serving as the backbone for numerous applications in machine learning, scientific computing, and signal processing. In particular, FP16 batch GEMM has become a core operation in deep learning frameworks due to its efficiency in training and inference. However, current implementations on AMD GPUs (e.g., CDNA/MI200 architectures with Matrix Cores) suffer from suboptimal memory access and low compute utilization, limiting performance in high-throughput scenarios. Therefore, this paper proposed a GPU optimization scheme for half-precision batch GEMM (HGEMM). In terms of blocking strategy, it allocates equal memory access and computational loads to threads based on input matrix sizes, while enabling each thread to compute multiple matrix multiplications to improve arithmetic unit utilization. For memory access optimization, it trades redundant data reads for uniform memory access patterns per thread to facilitate compiler optimization, ensuring overlapping of memory and computation time. For extremely small-batch HGEMM with matrix dimensions smaller than 16, the proposed method employs a 4 × 4 × 4 Matrix Core and its corresponding tiling scheme to enhance memory performance while reducing computational resource wastage, and provides the option of whether to use shared memory to achieve the highest performance. This paper compares the performance of this scheme with two operators of rocBLAS on the AMD GPU MI210 platform. The results show that the ave-rage performance of this scheme on AMD GPU MI210 is 4.14 times that of rocBLASHGEMMBatched and 4.96 times that of rocBLASGEMMExBatched. For extremely small-batch HGEMM, the average performance is 18.60 times that of rocBLASHGEMMBatched and 14.02 times that of rocBLASGEMMExBatched.

Key words: graphics processing unit, Matrix Core, matrix multiplication, memory access optimization

中图分类号: