Document Type

Theses, Masters


This item is available under a Creative Commons License for non-commercial use only


Computer Sciences

Publication Details

A dissertation submitted in partial fulfilment of the requirements of Technological University Dublin for the degree of M.Sc. in Computing (Data Analytics) January 2015.


Road traffic accidents are a significant cause of deaths worldwide and there is a global focus on understanding accident contributory factors and implementing prevention strategies. Although accident statistics are steadily improving, effective prevention must be persistent, evidence based and properly resourced. This research aimed to extract fatal traffic accident prediction from UK STATS19 accident data using C5.0 and Chaid decision trees and Bayes net classification models. Data was grouped as either fatal or non-fatal. The class imbalance due to fatal accident infrequency was considered and data transformation and sampling techniques were applied to increase prediction likelihood. Chaid was used for supervised discretisation and proved effective in identifying homogeneous subgroups. SPSS Modeler was used for data preparation and model build. Model performance was evaluated using accuracy, recall, precision and ROC curves. The experiment design and data preparation approach successfully predicted fatal accidents with high recall results, however, significant misclassification of non-fatals as fatals led to poor accuracy and precision performance. Boosting was subsequently tested and achieved some accuracy improvement. Serious accidents were grouped as non-fatal in the initial data analysis, however, are likely to hold similar characteristics to fatal and the models therefore struggled to classify correctly as non-fatal. Changing the experiment design to select fatal, serious and slight as targets may improve the models accuracy. Overall, the models succeeded in classifying fatal traffic accidents correctly and this was the original objective of the research. Interpretation of business rules, by ranking rules and summarising in a standard format, proved effective for understanding and comparison of key predictors. When comparing both C5.0 and Bayes net models, the contributory factors identified were consistent, with road surface and urban/rural identified as the strongest predictors for both models. The experiment demonstrated that classification techniques can be used to predict infrequent events once sampling techniques are applied.