华南理工大学学报(自然科学版) ›› 2018, Vol. 46 ›› Issue (1): 103-111.doi: 10.3969/j.issn.1000-565X.2018.01.014

• 计算机科学与技术 • 上一篇    下一篇

基于自适应帧采样算法和BLSTM的视频转文字研究

张荣锋 宁培阳 肖焕侯 史景伦 邱威   

  1. 华南理工大学 电子与信息学院
  • 收稿日期:2017-05-16 修回日期:2017-06-18 出版日期:2018-01-25 发布日期:2017-12-01
  • 通信作者: 张荣锋( 1980-) ,男,博士生,主要从事机器学习和视频处理研究 E-mail:rongfzhang@qq.com
  • 作者简介:张荣锋( 1980-) ,男,博士生,主要从事机器学习和视频处理研究
  • 基金资助:
    国家自然科学基金资助项目( 61671213) ;
    广州市人体数据科学重点实验室资助项目( 201605030011) 

Research on Video Description Based on Adaptive Frame Sampling Algorithm and Bidirectional Long Short-Term Memory

ZHANG Rongfeng NING Peiyang XIAO Huanhou SHI Jinglun QIU Wei   

  1. School of Electronic and Information Engineering,South China University of Technology
  • Received:2017-05-16 Revised:2017-06-18 Online:2018-01-25 Published:2017-12-01
  • Contact: 张荣锋( 1980-) ,男,博士生,主要从事机器学习和视频处理研究 E-mail:rongfzhang@qq.com
  • About author:张荣锋( 1980-) ,男,博士生,主要从事机器学习和视频处理研究
  • Supported by:
    The National Natural Science Foundation of China ( 61671213) 

摘要: 视频转文字(video to text)是计算机视觉领域一项新的挑战性任务。针对这个技术难题,提出了基于自适应帧采样算法和双向长短时记忆模型的视频转文字方法。自适应帧采样算法能够动态地调整采样率,以提供尽量多的特征来训练模型。结合双向长短时记忆模型,能有效学习视频中前面帧和未来帧的相关信息。同时,用于训练的特征是来自深度卷积神经网络的特征,使得这种双深度的网络结构能够学习视频帧在时空上的关联表示及全局依赖信息。帧信息的融合又增加了特征的种类,从而提升了实验效果。结果显示,在M-VAD和MPII-MD两个数据集中,本文的方法在METEOR中的评分均值分别为7.8和9.1,相对原S2VT模型分别提高了15.7%和28.2%,也提升了视频转文字的语言效果。

关键词: 视频转文字, 自适应帧采样, 双向长短时记忆模型, 深度卷积神经网络, 帧信息的融合

Abstract: Video to text is a new challenging task in the field of computer vision. Focusing on this technical difficulty, this paper proposes an adaptive sampling algorithms and employs the Bidirectional Long-Short Term Memory (BLSTM) model and deep BLSTM based on the video features extracting by deep Convolutional Neural Networks. Since this doubly deep networks structure can learn the spatial and temporal correlation description of the videos, it is able to obtain the global dependency information from space and time domain. Experimental results showed that by using the datasets of M-VAD and MPII-MD, the proposed framework could achieve the average score of 7.8 and 9.1 in METEOR, respectively. Comparing to the original S2VT model, the proposed method outperformed 15.7% and 28.2% by average score and it also improved the descriptions of the videos.

Key words: video to text, adaptive frame sampling, bidirectional LSTM, deep convolutional neural networks, fusion information of frames.

中图分类号: