Document Type



This item is available under a Creative Commons License for non-commercial use only


Computer Sciences

Publication Details

Dissertation submitted in partial fulfilment of the requirements of Technological University Dublin for the degree of M.Sc. in Computing (Stream), 2018.


Predicting the risk score of new and potential customers is used across the financial industry. By implementing the prediction of risk scores for their customers a chit fund company can improve the knowledge and customer understanding without relying on human knowledge. Data is collected on each customer before they have taken out credit and during the time they contribute to a chit fund. Having collected the necessary data, the company can then decide whether modelling customer risk would benefit them. As the data is available historically, one aspect of risk score prediction will be the focus of this thesis, supervised machine learning. Supervised machine learning techniques use historic data to ‘learn a model of the relationship between a set of descriptive features and a target feature’ (Kelleher, Mac Namee, & D’Arcy, 2015). There are many supervised machine learning techniques; support vector machine (SVM), logistic regression and decision trees will be the focal point of this thesis. The main objective of this project attempts to predict a risk score for new or potential subscribers of a chit fund company. The models generated would be suitable for use before a customer joins a chit fund group as well as while the customer is taking part in the group, measuring risk before becoming a subscriber and the behavioural risk while with the company. The objective is to extend research already carried out to predict a score from zero to one identifying the probability of default. Default, for the purpose of this project, is defined as being more than 90 days late with a payment. The data of real chit fund subscribers was used to train and test the models built for the project. A factor reduction technique was used to identify key variables, and multiple models were tested to determine which gives the best results. The second objective of this project will look at the subscriber network. This section of the project will check for links between subscribers, and investigate a possible link between subscribers and their chance of default. Variables such as address and nominee will be the focus in this section. iii The most successful supervised machine learning model was the random forest model with precision of 59% and recall of 92%. Accuracy for this model was the highest of each of the models in the experiment at 85%. However, this is not the most trustworthy evaluation measure for this project as the dataset is unbalanced. A combination of 300 decision trees were applied in this model. Using the classification method, the class that was predicted by the majority of trees was selected as the final prediction. This achieved high accuracy of the dataset from the chit fund company, Kyepot. Social network analysis found that there was no unusual relationship between subscribers that went into default with regards to the area in which they live or their nominees. Supervised machine learning techniques have been shown to be a useful tool in the financial industry. This project suggests that these techniques may also be useful tools for chit fund companies. This project evaluates four different techniques suggesting the random forest technique is the most useful for this chit fund company.