Document Type
Conference Paper
Disciplines
1.2 COMPUTER AND INFORMATION SCIENCE
Abstract
Image Captioning (IC) is the task of generating natural language descriptions for images. Models encode the image using a convolutional neural network (CNN) and generate the caption via a recurrent model or a multi-modal transformer. Success is measured by the similarity between generated captions and human-written “ground-truth” captions, using the CIDEr [14], SPICE [1] and METEOR [2] metrics. While incremental gains have been made on these metrics, there is a lack of focus on end-user opinions on the amount of content in captions. Studies with blind and low-vision participants have found that lack of detail is a problem [6, 13, 17], and that the preferred amount of content varies between individuals [13], as do individual opinions on the trade-off between correctness and adding additional content with lower confidence [9]. We propose a more user-centered approach with an adjustable amount of content based on the number of regions to describe.
DOI
https://doi.org/10.1145/3555776.3577794
Recommended Citation
Lindh, Annika; Ross, Robert J.; and Kelleher, John, "Show, Prefer and Tell: Incorporating User Preferences into Image Captioning" (2023). Conference papers. 409.
https://arrow.tudublin.ie/scschcomcon/409
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.
Publication Details
https://dl.acm.org/doi/pdf/10.1145/3555776.3577794
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied ComputingMarch 2023Pages 1139–1142
https://doi.org/10.1145/3555776.3577794