This item is available under a Creative Commons License for non-commercial use only
Image Captioning is a task that requires models to acquire a multi-modal understanding of the world and to express this understanding in natural language text. While the state-of-the-art for this task has rapidly improved in terms of n-gram metrics, these models tend to output the same generic captions for similar images. In this work, we address this limitation and train a model that generates more diverse and specific captions through an unsupervised training approach that incorporates a learning signal from an Image Retrieval model. We summarize previous results and improve the state-of-the-art on caption diversity and novelty.
We make our source code publicly available online: https://github.com/AnnikaLindh/Diverse_and_Specific_Image_Captioning
Lindh, A., Ross, R. J., Mahalunkar, A., Salton, G., & Kelleher, J. D. (2018). Generating Diverse and Meaningful Captions: Unsupervised Specificity Optimization for Image Captioning. In Artificial Neural Networks and Machine Learning – ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I.: Springer International Publishing.Proceedings, Part I.: Springer International Publishing.