1.2 COMPUTER AND INFORMATION SCIENCE, Information Science
Language embeddings are often used as black-box word-level tools that provide powerful language analysis across many tasks, but yet for many tasks such as Authorship Attribution access to feature level information on character n-grams can provide insights to help with model refinement and development. In this paper we investigate and evaluate the importance of character n-grams within an embeddings context in authorship attribution through the use of attention scores. We perform this investigation both for English (Reuters_50_50) and Russian (Taiga) news authorship datasets. Our analysis show that character n-grams attention score is higher for n-grams that are considered to be important for authorship identification for humans. Beyond specific benefits in authorship attribution, this work provides insights into the importance of character n-grams as a unit within embeddings.
Mukhmutova, Liliya; Ross, Robert J.; and Salton, Giancarlo, "Impact of Character n-grams Attention Scores for English and Russian News Articles Authorship Attribution" (2023). Conference papers. 13.
SFI through ML-Labs 18/CRT/6183
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.