Conference papers

Presenting a Labelled Dataset for Real-Time Detection of Abusive User Posts

Hao Chen, Technological University DublinFollow
Susan McKeever, Technological University DublinFollow
Sarah Jane Delany, Technological University DublinFollow

Document Type

Conference Paper

Rights

Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence

Disciplines

1.2 COMPUTER AND INFORMATION SCIENCE

Publication Details

WI'17: International Conference on Web Intelligence,

doi:10.1145/3106426.3106456

Abstract

Social media sites facilitate users in posting their own personal comments online. Most support free format user posting, with close to real-time publishing speeds. However, online posts generated by a public user audience carry the risk of containing inappropriate, potentially abusive content. To detect such content, the straightforward approach is to filter against blacklists of profane terms. However, this lexicon filtering approach is prone to problems around word variations and lack of context. Although recent methods inspired by machine learning have boosted detection accuracies, the lack of gold standard labelled datasets limits the development of this approach. In this work, we present a dataset of user comments, using crowdsourcing for labelling. Since abusive content can be ambiguous and subjective to the individual reader, we propose an aggregated mechanism for assessing different opinions from different labellers. In addition, instead of the typical binary categories of abusive or not, we introduce a third class of ‘undecided’ to capture the real life scenario of instances that are neither blatantly abusive nor clearly harmless. We have performed preliminary experiments on this dataset using best practice techniques in text classification. Finally, we have evaluated the detection performance of various feature groups, namely syntactic, semantic and context-based features. Results show these features can increase our classifier performance by 18% in detection of abusive content.

DOI

https://doi.org/10.1145/3106426.3106456

Recommended Citation

Chen, H., Mckeever, S. & Delany, S.J. (2017). Presenting a labelled dataset for real-time detection of abusive user posts. WI'17:Proceedings of the International Conference on Web Intelligence, Leipzig, Germany, August 23-26. doi:10.1145/3106426.3106456

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Conference papers

Presenting a Labelled Dataset for Real-Time Detection of Abusive User Posts

Document Type

Rights

Disciplines

Publication Details

Abstract

DOI

Recommended Citation

Included in

Search

Browse

Author Corner

Links

Conference papers

Presenting a Labelled Dataset for Real-Time Detection of Abusive User Posts

Authors

Document Type

Rights

Disciplines

Publication Details

Abstract

DOI

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links