Journal of South China University of Technology(Natural Science Edition)

• Computer Science & Technology • Previous Articles     Next Articles

Optimization of Matrix-Vector Multiplication Based on Ascend NPU

LU Lu GU Zhongshu   

  1. School of Computer Science and Engineering, South China University of Technology, Guangzhou Guangdong 510006, China

  • Published:2026-01-23

Abstract:

As the important kernel in the decoding phase of large language models (LLMs), the performance of General Matrix-Vector Multiplication (GEMV) directly determines the efficiency of model inference. However, current NPU default to implementing GEMV by calling the General Matrix Multiplication (GEMM), which has problems such as poor adaptability to small-dimension scenarios and low utilization of hardware resources. Based on the Ascend 910B NPU, this study focuses on the hardware adaptation and performance optimization of GEMV. Combined with the hardware characteristics of the AI Core in Ascend 910B, basic GEMV implementation schemes adapted to the Matrix Computing Unit (AIC) and Vector Computing Unit (AIV) within the AI Core are designed respectively: To meet the hardware data layout requirements of AIC, GEMV is equivalently transformed into the multiplication of a transposed matrix and a vector to improve the efficiency of result write-back; For the data access characteristics of AIV, differentiated computing processes are designed based on different data layouts to ensure the continuity of memory access and computing instructions. On this basis, general optimization strategies are explored: rectangular tiling is adopted to ensure continuous memory access and reduce the number of instruction issues; meanwhile, secondary task partitioning is used to solve the problem of load imbalance. Additionally, an operator fusion interface is provided to support the fusion of GEMV with elementwise operators, achieving a performance loss of less than 2% after fusion on AIV. Experimental results show that in single-precision (FP32), half-precision (FP16), and int8 integer scenarios, the performance of the GEMV implemented and optimized in this study reaches 3.38 times, 2.94 times, and 23.95 times (including dequantization) that of the default GEMM interface of the Ascend CANN platform, respectively. This provides support for efficient model deployment and operation on NPU platforms.

Key words: general matrix-vector multiplication, ascend NPU, operator fusion