Document Type
Dissertation
Rights
This item is available under a Creative Commons License for non-commercial use only
Disciplines
Computer Sciences
Abstract
Due to its persistence spam remains as one of the biggest problems facing users and suppliers of email communication services. Machine learning techniques have been very successful at preventing many spam mails from arriving in user mailboxes, however they still account for over 50% of all emails sent. Despite this relative success the economic cost of spam has been estimated as high as $50 billion in 2005 and more recently at $20 billion so spam can still be considered a considerable problem. In essence a spam email is a commercial communication trying to entice the receiver to take some positive action. This project uses the text from emails and creates personality insight and language tone scores through the use of IBM Watsons’ Tone Analyzer API. Those scores are used to investigate whether the language used in emails can be transformed into useful features that can be used to correctly classify them as spam or genuine emails. And during the course of this investigation a range of machine learning techniques are applied. Results from this experiment found that where just the personality insight and language tone features are used in the model some promising results with one dataset were shown. However over all datasets results were inconclusive with this model. Furthermore it was found that in a model where these features were used in combination with a normalised term-frequency feature-set no real improvement in the classification performance was shown.
DOI
https://doi.org/10.21427/D7WK7S
Recommended Citation
McGetrick, C. (2017) Investigation into the Application of Personality Insights and Language Tone Analysis in Spam Classificationlogy, 2017. doi:10.21427/D7WK7S
Publication Details
A dissertation submitted in partial fulfilment of the requirements of Technological University Dublin for the degree of M.Sc. in Computing (Data Analytics) April 2017