基于自适应帧采样算法和BLSTM的视频转文字研究

张荣锋 宁培阳 肖焕侯 史景伦 邱威

doi:10.3969/j.issn.1000-565X.2018.01.014

华南理工大学学报(自然科学版) >

2018 , Vol. 46 >Issue 1: 103 - 111

DOI: https://doi.org/10.3969/j.issn.1000-565X.2018.01.014

计算机科学与技术

基于自适应帧采样算法和BLSTM的视频转文字研究

展开

华南理工大学电子与信息学院

张荣锋( 1980-) ，男，博士生，主要从事机器学习和视频处理研究

收稿日期: 2017-05-16

修回日期: 2017-06-18

网络出版日期: 2017-12-01

基金资助

国家自然科学基金资助项目( 61671213) ;
广州市人体数据科学重点实验室资助项目( 201605030011)

收起

Research on Video Description Based on Adaptive Frame Sampling Algorithm and Bidirectional Long Short-Term Memory

Expand

School of Electronic and Information Engineering，South China University of Technology

张荣锋( 1980-) ，男，博士生，主要从事机器学习和视频处理研究

Received date: 2017-05-16

Revised date: 2017-06-18

Online published: 2017-12-01

Supported by

The National Natural Science Foundation of China ( 61671213)

Fold

摘要

视频转文字（video to text）是计算机视觉领域一项新的挑战性任务。针对这个技术难题，提出了基于自适应帧采样算法和双向长短时记忆模型的视频转文字方法。自适应帧采样算法能够动态地调整采样率，以提供尽量多的特征来训练模型。结合双向长短时记忆模型，能有效学习视频中前面帧和未来帧的相关信息。同时，用于训练的特征是来自深度卷积神经网络的特征，使得这种双深度的网络结构能够学习视频帧在时空上的关联表示及全局依赖信息。帧信息的融合又增加了特征的种类，从而提升了实验效果。结果显示，在M-VAD和MPII-MD两个数据集中，本文的方法在METEOR中的评分均值分别为7.8和9.1，相对原S2VT模型分别提高了15.7%和28.2%，也提升了视频转文字的语言效果。

关键词： 视频转文字; 自适应帧采样; 双向长短时记忆模型; 深度卷积神经网络; 帧信息的融合

本文引用格式

张荣锋宁培阳肖焕侯史景伦邱威 . 基于自适应帧采样算法和BLSTM的视频转文字研究[J]. 华南理工大学学报(自然科学版), 2018 , 46(1) : 103 -111 . DOI: 10.3969/j.issn.1000-565X.2018.01.014

Abstract

Video to text is a new challenging task in the field of computer vision. Focusing on this technical difficulty, this paper proposes an adaptive sampling algorithms and employs the Bidirectional Long-Short Term Memory (BLSTM) model and deep BLSTM based on the video features extracting by deep Convolutional Neural Networks. Since this doubly deep networks structure can learn the spatial and temporal correlation description of the videos, it is able to obtain the global dependency information from space and time domain. Experimental results showed that by using the datasets of M-VAD and MPII-MD, the proposed framework could achieve the average score of 7.8 and 9.1 in METEOR, respectively. Comparing to the original S2VT model, the proposed method outperformed 15.7% and 28.2% by average score and it also improved the descriptions of the videos.

Key words： video to text; adaptive frame sampling; bidirectional LSTM; deep convolutional neural networks; fusion information of frames.

参考文献

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献