Author ORCID Identifier

Document Type



Computer Sciences

Publication Details

Authors' accepted manuscript for Expert Systems with Applications

Published version:


Companies have an increasing access to very large datasets within their domain. Analysing these datasets often requires the application of feature selection techniques in order to reduce the dimensionality of the data and prioritise features for downstream knowledge generation tasks. Effective feature selection is a key part of clustering, regression and classification. It presents a myriad of opportunities to improve the machine learning pipeline: eliminating redundant and irrelevant features, reducing model over-fitting, faster model training times and more explainable models. By contrast, and despite the widespread availability and use of categorical data in practice, feature selection for categorical and/or mixed data has received relatively little attention in comparison to numerical data. Furthermore, existing feature selection methods for mixed data are sensitive to number of objects by having nonlinear time complexities with respect to number of objects. In this work, we propose a generic multiple association measure for mixed datasets and a novel feature selection algorithm that uses multiple association across features. Our algorithm is based upon the belief that the most representative chosen set of features should be as diverse and minimally dependent on each other as possible. The proposed algorithm formulates the problem of feature

selection as an optimization problem, searching for the set of features that have minimum association amongst them. We present a generic multiple association measure and two associated feature selection algorithms: Naive and Greedy Feature Selection Algorithms called NFSA and GFSA, respectively. Our proposed GFSA algorithm is evaluated on 15 benchmark datasets, and compared to four existing state of the art feature selection techniques. We demonstrate that our approach provides comparable downstream classification performance outperforming other leading techniques on several datasets. Both time complexity analysis and experimental results show that our proposed algorithm significantly reduces the processing time required for unsupervised feature selection algorithms especially for long datasets which have a huge number of objects, whilst also yielding comparable clustering and classification performance. On the other hand, we do not recommend our approach for wide datasets where the number of features is huge with respect to the number of objects e.g., image, text and genome datasets.



Horizon 2020

Available for download on Saturday, February 01, 2025

Included in

Engineering Commons