Automatic video captioning using tree hierarchical deep convolutional neural network and ASRNN-bi-directional LSTM

The development of automatic video understanding technology is highly needed due to the rise of mass video data, like surveillance videos, personal video data. Several methods have been presented previously for automatic video captioning. But, the existing methods have some problems, like more time...

Full description

Saved in:

Bibliographic Details
Published in:	Computing Vol. 106; no. 11; pp. 3691 - 3709
Main Authors:	Kavitha, N., Soundar, K. Ruba, Karthick, R., Kohila, J.
Format:	Journal Article
Language:	English
Published:	Vienna Springer Vienna 01-11-2024 Springer Nature B.V
Subjects:	Artificial Intelligence Artificial neural networks Coders Computer Appl. in Administrative Data Processing Computer Communication Networks Computer Science Information Systems Applications (incl.Internet) Neural networks Recurrent neural networks Regular Paper Software Engineering Video data Bi-directional LSTM Automatic video captioning Attention segmental recurrent neural network Tree hierarchical deep convolutional neural network
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The development of automatic video understanding technology is highly needed due to the rise of mass video data, like surveillance videos, personal video data. Several methods have been presented previously for automatic video captioning. But, the existing methods have some problems, like more time consume during processing a huge number of frames, and also it contains over fitting problem. This is a difficult task to automate the process of video caption. So, it affects final result (Caption) accuracy. To overcome these issues, Automatic Video Captioning using Tree Hierarchical Deep Convolutional Neural Network and attention segmental recurrent neural network-bi-directional Long Short-Term Memory (ASRNN-bi-directional LSTM) is proposed in this paper. The captioning part contains two phases: Feature Encoder and Decoder. In feature encoder phase, the tree hierarchical Deep Convolutional Neural Network (Tree CNN) encodes the vector representation of video and extract three kinds of features. In decoder phase, the attention segmental recurrent neural network (ASRNN) decode vector into textual description. ASRNN-base methods struck with long-term dependency issue. To deal this issue, focuses on all generated words from the bi-directional LSTM and caption generator for extracting global context information presented by concealed state of caption generator is local and unfinished. Hence, Golden Eagle Optimization is exploited to enhance ASRNN weight parameters. The proposed method is executed in Python. The proposed technique achieves 34.89%, 29.06% and 20.78% higher accuracy, 23.65%, 22.10% and 29.68% lesser Mean Squared Error compared to the existing methods.
ISSN:	0010-485X 1436-5057
DOI:	10.1007/s00607-024-01334-6