Journal of South China University of Technology (Natural Science Edition) ›› 2018, Vol. 46 ›› Issue (8): 88-95.doi: 10.3969/j.issn.1000-565X.2018.08.013

• Computer Science & Technology • Previous Articles     Next Articles

Video Captioning Based on C3D and visual elements

 XIAO Huanhou SHI Jinglun   

  1.  School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China
  • Received:2017-08-30 Revised:2018-03-21 Online:2018-08-25 Published:2018-07-01
  • Contact: 肖焕侯(1994-),男,博士生,主要从事深度学习和视频处理研究. E-mail:994082361@ qq. com
  • About author:肖焕侯(1994-),男,博士生,主要从事深度学习和视频处理研究
  • Supported by:
      the National Natural Science Foundation of China(61671213) 

Abstract: With the development of deep learning, the approach that extracts video feature using convolutional neural network (CNNs) and generates sentences using recurrent neural network (RNNs) is widely used in video caption task. However, this direct translation ignores many intrinsic information of video, such as temporal information, motion information, and abundant visual elements information. This paper proposes an AFCF-MVC model that uses the C3D features containing rich spatio-temporal information as the input to our network. At the same time, the adaptive feature extraction algorithm can exploit the whole video information, and the adaptive frame cycle filling algorithm can provide as many features as possible to the network, which plays the role of repeated learning. In addition, in order to make use of the rich visual elements of video, this paper detects the visual elements of video frames by a visual detector, and encode them into the network as additional supplementary information. Experimental results show that the proposed method has achieved the best performance in M-VAD and MPII-MD datasets.

Key words:  deep learning, convolutional neural network, recurrent neural networks, video captioning, selfadaptive, visual elements

CLC Number: