华南理工大学学报(自然科学版) ›› 2019, Vol. 47 ›› Issue (6): 51-56.doi: 10.12141/j.issn.1000-565X.180360

• 计算机科学与技术 • 上一篇    下一篇

基于 Tiny-yolo 的网络压缩与硬件加速方法

黄智勇 吴海华 虞智 仲元红   

  1. 重庆大学 微电子与通信工程学院,重庆 400044
  • 收稿日期:2018-07-08 修回日期:2018-11-13 出版日期:2019-06-25 发布日期:2019-05-05
  • 通信作者: 黄智勇(1978-),男,博士,副教授,主要从事无线传感器网络建模和高效能嵌入式计算研究. E-mail:zyhuang@cqu.edu.cn
  • 作者简介:黄智勇(1978-),男,博士,副教授,主要从事无线传感器网络建模和高效能嵌入式计算研究.
  • 基金资助:
    国家自然科学基金资助项目(61501069)

Method of Network Compression and Hardware Acceleration Based on Tiny-yolo

HUANG Zhiyong WU Haihua YU Zhi ZHONG Yuanhong    

  1. School of Microelectronics and Communication Engineering,Chongqing University,Chongqing 400044,China 
  • Received:2018-07-08 Revised:2018-11-13 Online:2019-06-25 Published:2019-05-05
  • Contact: 黄智勇(1978-),男,博士,副教授,主要从事无线传感器网络建模和高效能嵌入式计算研究. E-mail:zyhuang@cqu.edu.cn
  • About author:黄智勇(1978-),男,博士,副教授,主要从事无线传感器网络建模和高效能嵌入式计算研究.
  • Supported by:
     Supported by the National Natural Science Foundation of China(61501069)

摘要: 针对 Tiny-yolo 网络模型规模大、占内存多、计算量大、不易在嵌入式端实现的问 题,提出了网络压缩、结合硬件加速的方法对其进行优化. 首先,分析网络连接关系,对网 络贡献较小的连接进行裁剪实现网络压缩,裁剪后的权值矩阵采用稀疏化存储方式减少 内存占用;其次,对权值进行量化,通过改变数据的位数,在保证精度误差范围内进一步减 小内存占用量和计算复杂度;最后,根据 Tiny-yolo 网络结构特点提出了深度并行 - 流水 的 FPGA 加速优化方案,最终实现了 Tiny-yolo 网络运算的硬件加速. 通过实验验证,网络 裁剪结合量化可以实现 36X 左右的压缩比率,通过硬件加速优化,相比在最大频率为 667 MHz 的 ARM Cortex-A9 上运算实现了 7X 左右的运算加速.

关键词: 神经网络, Tiny-yolo, 压缩, 硬件加速, FPGA

Abstract: Existing works based on Tiny-yolo often need large-scale network model,occupy more memories,rely on massive calculation and are not easy to deploy on embedded devices. To solve these problems,an efficient optimi- zation method on network compression and hardware acceleration was proposed. Firstly,connections which have less contribution to the network was pruned after analyzing the network connection relationship and sparse storage was adopted for the clipped weight matrix to reduce the memory consumption. Secondly,memory footprint and com- putational complexity within the guaranteed accuracy error was further reduced through quantifying the weight data and changing the number of digits. Finally,according to the characteristics of the Tiny-yolo network structure,a deep parallel-stream FPGA acceleration optimization scheme was proposed and the hardware acceleration of the Ti- ny-yolo network computation was achieved. Experiments demonstrate that the purposed method based on network pruning and quantization can achieve about 36X compression for network model and approximately 7X speedup compared with CPU by hardware acceleration.

Key words: neural network, Tiny-yolo, compression, hardware acceleration, FPGA

中图分类号: