Articles

Shapley Idioms: Analysing BERT Sentence Embeddings for General Idiom Token Identification

Vasudevan Nedumpozhimana, Technological University DublinFollow
Filip Klubicka, Technological University DublinFollow
John Kelleher, Technological University DublinFollow

Document Type

Article

Rights

Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence

Disciplines

Computer Sciences

Publication Details

Open access

https://www.frontiersin.org/articles/10.3389/frai.2022.813967/full

Abstract

This article examines the basis of Natural Language Understanding of transformer based language models, such as BERT. It does this through a case study on idiom token classification. We use idiom token identification as a basis for our analysis because of the variety of information types that have previously been explored in the literature for this task, including: topic, lexical, and syntactic features. This variety of relevant information types means that the task of idiom token identification enables us to explore the forms of linguistic information that a BERT language model captures and encodes in its representations. The core of this article presents three experiments. The first experiment analyzes the effectiveness of BERT sentence embeddings for creating a general idiom token identification model and the results indicate that the BERT sentence embeddings outperform Skip-Thought. In the second and third experiment we use the game theory concept of Shapley Values to rank the usefulness of individual idiomatic expressions for model training and use this ranking to analyse the type of information that the model finds useful. We find that a combination of idiom-intrinsic and topic-based properties contribute to an expression's usefulness in idiom token identification. Overall our results indicate that BERT efficiently encodes a variety of information from topic, through lexical and syntactic information. Based on these results we argue that notwithstanding recent criticisms of language model based semantics, the ability of BERT to efficiently encode a variety of linguistic information types does represent a significant step forward in natural language understanding.

DOI

https://doi.org/10.3389/frai.2022.813967

Recommended Citation

Nedumpozhimana V, Klubička F and Kelleher JD (2022) Shapley Idioms: Analysing BERT Sentence Embeddings for General Idiom Token Identification. Front. Artif. Intell. 5:813967. doi: 10.3389/frai.2022.813967

Funder

Science Foundation Ireland

Download

Contact the Author

Included in

Computer Sciences Commons

COinS

Articles

Shapley Idioms: Analysing BERT Sentence Embeddings for General Idiom Token Identification

Document Type

Rights

Disciplines

Publication Details

Abstract

DOI

Recommended Citation

Funder

Included in

Search

Browse

Author Corner

Links

Articles

Shapley Idioms: Analysing BERT Sentence Embeddings for General Idiom Token Identification

Authors

Document Type

Rights

Disciplines

Publication Details

Abstract

DOI

Recommended Citation

Funder

Included in

Share

Search

Browse

Author Corner

Links