Document Type

Conference Paper

Disciplines

1.2 COMPUTER AND INFORMATION SCIENCE, Information Science

Publication Details

https://dl.acm.org/doi/abs/10.1145/3555776.3577856

Makhmutova, L., Ross, R. & Salton, G. (2023). Impact of Character n-grams Attention Scores for English and Russian News Articles Authorship Attribution. SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing, Tallinn, Estonia, March 27 - 31.

doi: 978-1-4503-9517-5

Abstract

Language embeddings are often used as black-box word-level tools that provide powerful language analysis across many tasks, but yet for many tasks such as Authorship Attribution access to feature level information on character n-grams can provide insights to help with model refinement and development. In this paper we investigate and evaluate the importance of character n-grams within an embeddings context in authorship attribution through the use of attention scores. We perform this investigation both for English (Reuters_50_50) and Russian (Taiga) news authorship datasets. Our analysis show that character n-grams attention score is higher for n-grams that are considered to be important for authorship identification for humans. Beyond specific benefits in authorship attribution, this work provides insights into the importance of character n-grams as a unit within embeddings.

DOI

https://doi.org/978-1-4503-9517-5

Funder

SFI through ML-Labs 18/CRT/6183

Creative Commons License

Creative Commons Attribution-Share Alike 4.0 International License
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.


Share

COinS