随着深度学习技术的发展,利用卷积神经网络(CNNs)提取视频帧的特征,再用循环神经网络(RNNs)生成句子的方法被广泛应用于视频描述任务当中。然而,这种直接转换方式忽略了很多视频内在信息,如视频序列的时序信息、运动信息、以及丰富的视觉元素信息等。本文提出了一个AFCF-MVC模型,将含有丰富时空信息的视频C3D特征作为网络输入特征,同时,自适应特征提取法可以利用到视频序列所有帧的信息,自适应帧循环填充法可为网络提供尽可能多的特征输入,并起到重复学习的作用。另外,为了将视频丰富的视觉元素信息利用起来,本文通过视觉检测器检测出视频帧的视觉元素信息,编码后作为额外的补充信息融合进网络中。实验结果显示,本文提出的方法在M-VAD和MPII-MD数据库中取得了最好的效果。
With the development of deep learning, the approach that extracts video feature using convolutional neural network (CNNs) and generates sentences using recurrent neural network (RNNs) is widely used in video caption task. However, this direct translation ignores many intrinsic information of video, such as temporal information, motion information, and abundant visual elements information. This paper proposes an AFCF-MVC model that uses the C3D features containing rich spatio-temporal information as the input to our network. At the same time, the adaptive feature extraction algorithm can exploit the whole video information, and the adaptive frame cycle filling algorithm can provide as many features as possible to the network, which plays the role of repeated learning. In addition, in order to make use of the rich visual elements of video, this paper detects the visual elements of video frames by a visual detector, and encode them into the network as additional supplementary information. Experimental results show that the proposed method has achieved the best performance in M-VAD and MPII-MD datasets.