Document Type

Dissertation

Rights

This item is available under a Creative Commons License for non-commercial use only

Disciplines

Computer Sciences

Publication Details

A dissertation submitted in partial fulfilment of the requirements of Technological University Dublin for the degree o M.Sc. in Computer Science (Data Science) 2021

Abstract

Generating description to images is a recent surge and with latest developments in the field of Artificial Intelligence, it can be one of the prominent applications to bridge the gap between Computer vision and Natural language processing fields. In terms of the learning curve, Deep learning has become the main backbone in driving many new applications. Image Captioning is one such application where the usage of Deep learning methods enhanced the performance of the captioning accuracy. The introduction of the Encoder-Decoder framework was a breakthrough in Image captioning. But as the sequences got longer the performance of captions was affected. To overcome this the usage of the attention mechanism as an extension to the Encoder- Decoder framework became an upward trend. Where an Attention mechanism generates a context vector having calculated information of pixels and using this information the decoder focuses on a particular region of an image and generates caption. Researchers proposed various attention mechanisms to generate a context vector having calculated information of image pixels. Luong et al. (2015) are one such who proposed a Global attention mechanism that makes a decoder look at the calculated pixels of the image at each time step while generating the caption. Similarly, an attention mechanism named Adaptive attention was proposed by Lu et al. (2017) which allows the decoder to decide whether the calculated pixels of an image need to be focused at each time step or needs to concentrate on a language model. This research proposes a comparative study of these two attention mechanisms in the generation of captions for images using the Flickr30k dataset. A deep Residual Network with 152 layers (ResNet-152) is used as an encoder and an LSTM is used as the decoder. An evaluation of the model is performed using BLEU, METEOR, ROUGE, CIDEr metrics and results show the usage of Adaptive attention over Global attention would yield better metric scores.

DOI

https://doi.org/10.21427/wv3a-md49


Share

COinS