Document Type
Presentation
Rights
Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence
Disciplines
Computer Sciences
Abstract
Positional encoding is used in both natural language and computer vision transformers. It provides information on sequence order and relative position of input tokens (such as of words in a sentence) for higher performance. Unlike the pure language and vision transformers, vision-language transformers do not currently exploit positional encoding schemes to enrich input information. We show that capturing location information of visual features can help vision-language transformers improve their performance. We take Oscar, one of the state-of-the-art (SOTA) vision-language transformers as an example transformer for implanting positional encoding. We use image captioning as a downstream task to test performance. We added two types of positional encoding into Oscar: DETR as an absolute positional encoding approach and iRPE, for relative positional encoding. With the same training protocol and data, both positional encodings improved the image captioning performance of Oscar by between 6.8% to 24.1% across five image captioning evaluation criteria used.
DOI
https://doi.org/10.21427/DQ99-6T76
Recommended Citation
Liu, X., Delany, S. J., & McKeever, S. (2023). Applying Positional Encoding to Enhance Vision-Language Transformers. Technological University Dublin. DOI: 10.21427/DQ99-6T76
Funder
Science Foundation Ireland
Publication Details
This paper is accepted by VISAPP 2023 in Lisbon, Portugal.