Document Type

Conference Paper

Publication Details

https://doi.org/10.1145/3677117.3685006

Abstract

The proliferation of hate speech on digital platforms has become a significant issue, and automated content moderation systems built on machine learning are a proposed solution. However, they face challenges in multilingual and low-resource settings due to the need for extensive labelled data. This paper introduces an explainable AI framework designed to identify annotation discrepancies in low-resource languages, focusing on Hindi, the third most-spoken language worldwide, which lacks comprehensive research in hate speech detection. By examining the labelling quality of the Hate speech and Offensive Content Identification in English and Indo-Aryan Languages (HASOC) challenge, we use unsupervised learning methods to extract topical variations and annotation behavior and apply these features in an explainable AI-based classification model, TabNet. We release a relabelled Hindi hate speech benchmark dataset with label-flipping information and related metadata to facilitate research in this area. The source code has also been released for reproducibility purposes. Please be advised that this work contains examples of toxic content

DOI

10.1145/3677117.3685006

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


Share

COinS