Author ORCID Identifier


Document Type

Theses, Ph.D


Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence




The application of data analytics to educational settings is an emerging and growing research area. Much of the published works to-date are based on ever-increasing volumes of log data that are systematically gathered in virtual learning environments as part of module delivery. This thesis took a unique approach to modelling academic performance; it is a first study to model indicators of students at risk of failing in first year of tertiary education, based on data gathered prior to commencement of first year, facilitating early engagement with at-risk students.

The study was conducted over three years, in 2010 through 2012, and was based on a sample student population (n=1,207) aged between 18 and 60 from a range of academic disciplines. Data was extracted from both student enrolment data maintained by college administration, and an online, self-reporting, learner profiling tool developed specifically for this study. The profiling tool was administered during induction sessions for students enrolling into the first year of study. Twenty-four factors relating to prior academic performance, personality, motivation, self-regulation, learning approaches, learner modality, age and gender were considered.

Eight classification algorithms were evaluated. Cross validation model accuracies based on all participants were compared with models trained on the 2010 and 2011 student cohorts, and tested on the 2012 student cohort. Best cross validation model accuracies were a Support Vector Machine (82%) and Neural Network (75%). The k-Nearest Neighbour model, which has received little attention in educational data mining studies, achieved highest model accuracy when applied to the 2012 student cohort (72%). The performance was similar to its cross validation model accuracy (72%). Model accuracies for other algorithms applied to the 2012 student cohort also compared favourably; for example Ensembles (71%), Support Vector Machine (70%) and a Decision Tree (70%).

Models of subgroups by age and by academic discipline achieved higher accuracy than models of all participants, however, a larger sample size is needed to confirm results. Progressive sampling showed a sample size > 900 was required to achieve convergence of model accuracy.

Results showed that factors most predictive of academic performance in first year of study at tertiary education included age, prior academic performance and self-efficacy. Kinaesthetic modality was also indicative of students at risk of failing, a factor that has not been cited previously as a significant predictor of academic performance.

Models reported in this study show that learner profiling completed prior to commencement of first year of study yielded informative and generalisable results that identified students at risk of failing. Additionally, model accuracies were comparable to models reported elsewhere that included data collected from student activity in semester one, confirming the validity of early student profiling.