1.2 COMPUTER AND INFORMATION SCIENCE
Image Captioning (IC) is the task of generating natural language descriptions for images. Models encode the image using a convolutional neural network (CNN) and generate the caption via a recurrent model or a multi-modal transformer. Success is measured by the similarity between generated captions and human-written “ground-truth” captions, using the CIDEr , SPICE  and METEOR  metrics. While incremental gains have been made on these metrics, there is a lack of focus on end-user opinions on the amount of content in captions. Studies with blind and low-vision participants have found that lack of detail is a problem [6, 13, 17], and that the preferred amount of content varies between individuals , as do individual opinions on the trade-off between correctness and adding additional content with lower confidence . We propose a more user-centered approach with an adjustable amount of content based on the number of regions to describe.
Lindh, Annika; Ross, Robert J.; and Kelleher, John, "Show, Prefer and Tell: Incorporating User Preferences into Image Captioning" (2023). Conference papers. 409.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.