Conference papers

Applying Positional Encoding to Enhance Vision-Language Transformers

Xuehao Liu, Technological University DublinFollow
Sarah Jane Delany, Technological University DublinFollow
Susan McKeever, Technological University DublinFollow

Document Type

Presentation

Rights

Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence

Disciplines

Computer Sciences

Publication Details

This paper is accepted by VISAPP 2023 in Lisbon, Portugal.

Published online:

https://www.scitepress.org/ProceedingsDetails.aspx?ID=H6DnywaGXmg=&t=1

Conference website:

https://visapp.scitevents.org/?y=2023

Abstract

Positional encoding is used in both natural language and computer vision transformers. It provides information on sequence order and relative position of input tokens (such as of words in a sentence) for higher performance. Unlike the pure language and vision transformers, vision-language transformers do not currently exploit positional encoding schemes to enrich input information. We show that capturing location information of visual features can help vision-language transformers improve their performance. We take Oscar, one of the state-of-the-art (SOTA) vision-language transformers as an example transformer for implanting positional encoding. We use image captioning as a downstream task to test performance. We added two types of positional encoding into Oscar: DETR as an absolute positional encoding approach and iRPE, for relative positional encoding. With the same training protocol and data, both positional encodings improved the image captioning performance of Oscar by between 6.8% to 24.1% across five image captioning evaluation criteria used.

DOI

https://doi.org/10.5220/0011796100003417

Recommended Citation

Liu, X.; Delany, S. and McKeever, S. (2023). Applying Positional Encoding to Enhance Vision-Language Transformers. In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, ISBN 978-989-758-634-7; ISSN 2184-4321, SciTePress, pages 838-845. DOI: 10.5220/0011796100003417

Funder

Science Foundation Ireland

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Contact the Author

Included in

Computer Sciences Commons

COinS

Conference papers

Applying Positional Encoding to Enhance Vision-Language Transformers

Document Type

Rights

Disciplines

Publication Details

Abstract

DOI

Recommended Citation

Funder

Creative Commons License

Included in

Search

Browse

Author Corner

Links

Conference papers

Applying Positional Encoding to Enhance Vision-Language Transformers

Authors

Document Type

Rights

Disciplines

Publication Details

Abstract

DOI

Recommended Citation

Funder

Creative Commons License

Included in

Share

Search

Browse

Author Corner

Links