Document Type

Conference Paper

Disciplines

1.2 COMPUTER AND INFORMATION SCIENCE

Publication Details

https://dl.acm.org/doi/pdf/10.1145/3555776.3577794

SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied ComputingMarch 2023Pages 1139–1142

https://doi.org/10.1145/3555776.3577794

Abstract

Image Captioning (IC) is the task of generating natural language descriptions for images. Models encode the image using a convolutional neural network (CNN) and generate the caption via a recurrent model or a multi-modal transformer. Success is measured by the similarity between generated captions and human-written “ground-truth” captions, using the CIDEr [14], SPICE [1] and METEOR [2] metrics. While incremental gains have been made on these metrics, there is a lack of focus on end-user opinions on the amount of content in captions. Studies with blind and low-vision participants have found that lack of detail is a problem [6, 13, 17], and that the preferred amount of content varies between individuals [13], as do individual opinions on the trade-off between correctness and adding additional content with lower confidence [9]. We propose a more user-centered approach with an adjustable amount of content based on the number of regions to describe.

DOI

https://doi.org/10.1145/3555776.3577794

Creative Commons License

Creative Commons Attribution-Share Alike 4.0 International License
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.


Share

COinS