华南理工大学学报(自然科学版) ›› 2018, Vol. 46 ›› Issue (8): 88-95.doi: 10.3969/j.issn.1000-565X.2018.08.013

• 计算机科学与技术 • 上一篇    下一篇

基于C3D和视觉元素的视频描述

肖焕侯,史景伦   

  1. 华南理工大学
  • 收稿日期:2017-08-30 修回日期:2018-03-21 出版日期:2018-08-25 发布日期:2018-07-01
  • 通信作者: 肖焕侯(1994-),男,博士生,主要从事深度学习和视频处理研究. E-mail:994082361@ qq. com
  • 作者简介:肖焕侯(1994-),男,博士生,主要从事深度学习和视频处理研究
  • 基金资助:
    国家自然科学基金;
    广州市人体数据科学重点实验室项目基金

Video Captioning Based on C3D and visual elements

 XIAO Huanhou SHI Jinglun   

  1.  School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China
  • Received:2017-08-30 Revised:2018-03-21 Online:2018-08-25 Published:2018-07-01
  • Contact: 肖焕侯(1994-),男,博士生,主要从事深度学习和视频处理研究. E-mail:994082361@ qq. com
  • About author:肖焕侯(1994-),男,博士生,主要从事深度学习和视频处理研究
  • Supported by:
      the National Natural Science Foundation of China(61671213) 

摘要: 随着深度学习技术的发展,利用卷积神经网络(CNNs)提取视频帧的特征,再用循环神经网络(RNNs)生成句子的方法被广泛应用于视频描述任务当中。然而,这种直接转换方式忽略了很多视频内在信息,如视频序列的时序信息、运动信息、以及丰富的视觉元素信息等。本文提出了一个AFCF-MVC模型,将含有丰富时空信息的视频C3D特征作为网络输入特征,同时,自适应特征提取法可以利用到视频序列所有帧的信息,自适应帧循环填充法可为网络提供尽可能多的特征输入,并起到重复学习的作用。另外,为了将视频丰富的视觉元素信息利用起来,本文通过视觉检测器检测出视频帧的视觉元素信息,编码后作为额外的补充信息融合进网络中。实验结果显示,本文提出的方法在M-VAD和MPII-MD数据库中取得了最好的效果。

关键词: 深度学习, 卷积神经网络, 循环神经网络, 视频描述, 自适应, 视觉元素

Abstract: With the development of deep learning, the approach that extracts video feature using convolutional neural network (CNNs) and generates sentences using recurrent neural network (RNNs) is widely used in video caption task. However, this direct translation ignores many intrinsic information of video, such as temporal information, motion information, and abundant visual elements information. This paper proposes an AFCF-MVC model that uses the C3D features containing rich spatio-temporal information as the input to our network. At the same time, the adaptive feature extraction algorithm can exploit the whole video information, and the adaptive frame cycle filling algorithm can provide as many features as possible to the network, which plays the role of repeated learning. In addition, in order to make use of the rich visual elements of video, this paper detects the visual elements of video frames by a visual detector, and encode them into the network as additional supplementary information. Experimental results show that the proposed method has achieved the best performance in M-VAD and MPII-MD datasets.

Key words:  deep learning, convolutional neural network, recurrent neural networks, video captioning, selfadaptive, visual elements

中图分类号: