Home > Published Issues > 2023 > Volume 11, No. 3, September 2023 >
JOIG 2023 Vol.11(3):294-301
doi: 10.18178/joig.11.3.294-301

Evaluating Performances of Attention-Based Merge Architecture Models for Image Captioning in Indian Languages

Rahul Tangsali, Swapnil Chhatre *, Soham Naik, Pranav Bhagwat, and Geetanjali Kale
Department of Computer Engineering, SCTR’s Pune Institute of Computer Technology, Pune, India;
Email: rahuul2001@gmail.com (R.T.), nsoham01@gmail.com (S.N.), gvkale@pict.edu (G.K.), pranav221b@gmail.com (P.B.)
*Correspondence: swapchhatre5@gmail.com (S.C.)

Manuscript received February 6, 2023; revised March 23, 2023; accepted April 15, 2023.

Abstract—Image captioning is a growing topic of research in which numerous advancements have been made in the past few years. Deep learning methods have been used extensively for generating textual descriptions of image data. In addition, attention-based image captioning mechanisms have also been proposed, which give state-ofthe- art results in image captioning. However, many applications and analyses of these methodologies have not been made in the case of languages from the Indian subcontinent. This paper presents attention-based merge architecture models to achieve accurate captions of images in four Indian languages- Marathi, Kannada, Malayalam, and Tamil. The widely known Flickr8K dataset was used for this project. Pre-trained Convolutional Neural Network (CNN) models and language decoder attention models were implemented, which serve as the components of the mergearchitecture proposed here. Finally, the accuracy of the generated captions was compared against the gold captions using Bilingual Evaluation Understudy (BLEU) as an evaluation metric. It was observed that the merge architectures consisting of InceptionV3 give the best results for the languages we test on, the scores discussed in the paper. Highest BLEU-1 scores obtained for each language were: 0.4939 for Marathi, 0.4557 for Kannada, 0.5082 for Malayalam, and 0.5201 for Tamil. Our proposed architectures gave much higher scores than other architectures implemented for these languages.

Keywords—image captioning, Recurrent Neural Networks (RNN), Long Short-Term Memory Unit (LSTM), GRU, Pretrained Convolutional Neural Network (CNN) models, Indian languages

Cite: Rahul Tangsali, Swapnil Chhatre, Soham Naik, Pranav Bhagwat, and Geetanjali Kale, "Evaluating Performances of Attention-Based Merge Architecture Models for Image Captioning in Indian Languages," Journal of Image and Graphics, Vol. 11, No. 3, pp. 294-301, September 2023.

Copyright © 2023 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.