Document Type



This item is available under a Creative Commons License for non-commercial use only


Computer Sciences

Publication Details

A dissertation submitted in partial fulfilment of the requirements of Technological University Dublin for the degree of M.Sc. in Computing (Data Analytics) 28 September 2020.


This dissertation proposes LightGWAS, a novel machine learning procedure for genome-wide association study (GWAS) based on LightGBM and k-fold cross-validation. The conducted literature review identified that the currently available GWAS implementations rely on massive manual quality control steps to address statistical issues, such as controlling for false-positive inflation and power reduction. It also showed they demand a specific GWAS method for each type of genomic dataset morphology, which consequently increases the human dependency and open margins for misleadings. LightGWAS is a potential single, resilient, autonomous and scalable solution to address such concerns. Through this research, LightGWAS was contrasted against the current state-of-the-art for GWAS throughout secondary research method. It has been compared with a GWAS implementation based on general linear model (GLM) with support to Firth regularisation. Quantitative empirical tests and deductive reasoning have been employed to reach and evaluate the results. The models were submitted to balanced (case:control=1:1), imbalanced (case:control=1:10), and high-imbalanced (case:control=1:100) genomic datasets of binary phenotypes. The results from statistical tests denoted that LightGWAS performs equivalently to the compared GLM method for balanced dataset scenarios, and outperforms for imbalanced and high-imbalanced datasets. The assessed metrics were weighted average of the precision and recall (F1), recall, average precision score (APS), receiver operating characteristic (ROC)/area under the curve (AUC), accuracy, and precision.