Author ORCID Identifier
Companies have an increasing access to very large datasets within their domain. Analysing these datasets often requires the application of feature selection techniques in order to reduce the dimensionality of the data and prioritise features for downstream knowledge generation tasks. Effective feature selection is a key part of clustering, regression and classification. It presents a myriad of opportunities to improve the machine learning pipeline: eliminating redundant and irrelevant features, reducing model over-fitting, faster model training times and more explainable models. By contrast, and despite the widespread availability and use of categorical data in practice, feature selection for categorical and/or mixed data has received relatively little attention in comparison to numerical data. Furthermore, existing feature selection methods for mixed data are sensitive to number of objects by having nonlinear time complexities with respect to number of objects. In this work, we propose a generic multiple association measure for mixed datasets and a novel feature selection algorithm that uses multiple association across features. Our algorithm is based upon the belief that the most representative chosen set of features should be as diverse and minimally dependent on each other as possible. The proposed algorithm formulates the problem of feature
selection as an optimization problem, searching for the set of features that have minimum association amongst them. We present a generic multiple association measure and two associated feature selection algorithms: Naive and Greedy Feature Selection Algorithms called NFSA and GFSA, respectively. Our proposed GFSA algorithm is evaluated on 15 benchmark datasets, and compared to four existing state of the art feature selection techniques. We demonstrate that our approach provides comparable downstream classification performance outperforming other leading techniques on several datasets. Both time complexity analysis and experimental results show that our proposed algorithm significantly reduces the processing time required for unsupervised feature selection algorithms especially for long datasets which have a huge number of objects, whilst also yielding comparable clustering and classification performance. On the other hand, we do not recommend our approach for wide datasets where the number of features is huge with respect to the number of objects e.g., image, text and genome datasets.
Ayman Taha, Ali S. Hadi, Bernard Cosgrave, Susan McKeever, A multiple association-based unsupervised feature selection algorithm for mixed data sets, Expert Systems with Applications, Volume 212, 2023, 118718, ISSN 0957-4174, DOI: 10.1016/j.eswa.2022.118718.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Available for download on Saturday, February 01, 2025