计算机科学与技术

基于昇腾NPU的矩阵向量乘优化策略

展开
  • 华南理工大学 计算机科学与工程学院,广东 广州 510006

网络出版日期: 2026-01-23

Optimization of Matrix-Vector Multiplication Based on Ascend NPU

Expand
  • School of Computer Science and Engineering, South China University of Technology, Guangzhou Guangdong 510006, China

Online published: 2026-01-23

摘要

 矩阵向量乘法(GEMV)作为大语言模型解码阶段的核心运算,其性能直接决定模型推理的效率,然而当前NPU平台默认通过调用通用矩阵乘法(GEMM)接口实现GEMV,存在小维度场景适配性差、硬件资源利用率低等问题。本文基于昇腾910B NPU,围绕GEMV的硬件适配与性能优化展开研究。结合昇腾910B AI Core的硬件特性,研究基于其中的矩阵运算单元(AIC)与向量运算单元(AIV)分别设计了适配两类单元的GEMV基础实现方案:针对AIC的硬件数据排布要求,将GEMV等价转换为转置矩阵与向量的乘法以提升结果写回效率;针对AIV的数据访存特性,基于不同数据排布设计差异化计算流程,保障内存访问和计算指令的连续性。在此基础上,探索了通用优化策略通过长方形分块保证访存连续并减少指令发射次数,同时,通过任务二次切分解决负载不均的问题。同时提供算子融合接口,支持GEMV与elementwise类算子的融合,在AIV上实现融合后性能损耗低于2%。实验结果表明,在单精度(FP32)、半精度(FP16)与int8整型场景下,本研究实现并优化的GEMV性能分别达到昇腾CANN平台默认GEMM接口的3.38倍、2.94倍与23.95倍(含反量化),为NPU平台上高效模型部署运行提供了支撑。


本文引用格式

陆璐, 古钟书 . 基于昇腾NPU的矩阵向量乘优化策略[J]. 华南理工大学学报(自然科学版), 0 : 1 . DOI: 10.12141/j.issn.1000-565X.250406

Abstract

As the important kernel in the decoding phase of large language models (LLMs), the performance of General Matrix-Vector Multiplication (GEMV) directly determines the efficiency of model inference. However, current NPU default to implementing GEMV by calling the General Matrix Multiplication (GEMM), which has problems such as poor adaptability to small-dimension scenarios and low utilization of hardware resources. Based on the Ascend 910B NPU, this study focuses on the hardware adaptation and performance optimization of GEMV. Combined with the hardware characteristics of the AI Core in Ascend 910B, basic GEMV implementation schemes adapted to the Matrix Computing Unit (AIC) and Vector Computing Unit (AIV) within the AI Core are designed respectively: To meet the hardware data layout requirements of AIC, GEMV is equivalently transformed into the multiplication of a transposed matrix and a vector to improve the efficiency of result write-back; For the data access characteristics of AIV, differentiated computing processes are designed based on different data layouts to ensure the continuity of memory access and computing instructions. On this basis, general optimization strategies are explored: rectangular tiling is adopted to ensure continuous memory access and reduce the number of instruction issues; meanwhile, secondary task partitioning is used to solve the problem of load imbalance. Additionally, an operator fusion interface is provided to support the fusion of GEMV with elementwise operators, achieving a performance loss of less than 2% after fusion on AIV. Experimental results show that in single-precision (FP32), half-precision (FP16), and int8 integer scenarios, the performance of the GEMV implemented and optimized in this study reaches 3.38 times, 2.94 times, and 23.95 times (including dequantization) that of the default GEMM interface of the Ascend CANN platform, respectively. This provides support for efficient model deployment and operation on NPU platforms.

Options
文章导航

/