Document Type
Conference Paper
Rights
This item is available under a Creative Commons License for non-commercial use only
Disciplines
Computer Sciences
Abstract
An urgent limitation in current Image Captioning models is their tendency to produce generic captions that avoid the interesting detail which makes each image unique. To address this limitation, we propose an approach that enforces a stronger alignment between image regions and specific segments of text. The model architecture is composed of a visual region proposer, a region-order planner and a region-guided caption generator. The region-guided caption generator incorporates a novel information gate which allows visual and textual input of different frequencies and dimensionalities in a Recurrent Neural Network.
Recommended Citation
Lindh, A., Ross, R. J., & Kelleher, J. D. (2018). Entity-Grounded Image Captioning. ECCV 2018 Workshop on Shortcomings in Vision and Language (SiVL), Munich, Germany, September 8, 2018. doi:10.21427/D7ZN6Q
DOI
https://doi.org/10.21427/D7ZN6Q
Publication Details
ECCV 2018 Workshop on Shortcomings in Vision and Language (SiVL), Munich, Germany, September 8, 2018.