Journal of South China University of Technology(Natural Science Edition) ›› 2024, Vol. 52 ›› Issue (2): 13-22.doi: 10.12141/j.issn.1000-565X.230066

• Computer Science & Technology • Previous Articles     Next Articles

Design and Optimization of Single-Node HPL-AI Benchmark for a Heterogeneous Platform Composed of Kunpeng and Ascend

WU Haotian1,2 REN Changqing1 LU Lu1 XU Pengxiang3 YANG Kai3   

  1. 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
    2.Zhengzhou Xinda Institute of Advanced Technology,Zhengzhou 450001,Henan,China
    3.Peng Cheng Laboratory,Shenzhen 518000,Guangdong,China
  • Received:2023-02-27 Online:2024-02-25 Published:2023-05-22
  • About author:吴昊天(1980-),男,博士,副教授,主要从事可逆信息隐藏、隐私计算、图像处理、高性能计算和区块链研究。E-mail:wuht@scut.edu.cn
  • Supported by:
    the Natural Science Foundation of Guangdong Province(2021A1515011798);the Open Foundation of Henan Key Laboratory of Cyberspace Situation Awareness(HNTS2022017)

Abstract:

Given the faster speed of low-precision floating point operations, more and more high-performance applications are using hybrid precision solutions to accelerate.The large AI (artificial intelligence) models that use this scheme to accelerate has also received wide attention. Recently, the HPL-AI (High Performance LINPACK for Accelerator Introspection) benchmark has been proposed to evaluate the mixed-precision computing performance of high-performance systems. For this benchmark test, this study designed and optimized the implementation of single-node HPL-AI benchmark test on Kunpeng and Ascend heterogeneous platforms. In order to balance the load of the AI processor, the tasks were evenly distributed to the AI processors through the cyclic task allocation strategy. The task allocation strategy with interval value was used to improve the continuity of data transmission to reduce the data transmission time between CPU and AI processor. Without affecting the calculation accuracy, the computation on the CPU side was reduced by the strategy of canceling the data scaling. The final experimental results show that the HPL-AI benchmark has the fastest mixed-precision floating-point arithmetic speed when the interval value is 8; at the same time, unscaling the data does not affect the accuracy of the HPL-AI benchmark results. Compared with the non-optimized HPL-AI benchmark implementation on the heterogeneous platform of Kunpeng and Ascend, the optimization strategy proposed in this paper improves the mixed-precision floating-point arithmetic speed by about 29%, which lays a solid foundation for the further optimization of single-node HPL-AI benchmark and the deployment of multi-node HPL-AI benchmark.

Key words: Kunpeng, Ascend, heterogeneous platform, benchmark test, high performance computing, mixed precision

CLC Number: