Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence
Computer Sciences, Information Science
Clustering is a fundamental machine learning application, which partitions data into homogeneous groups. K-means and its variants are the most widely used class of clustering algorithms today. However, the original k-means algorithm can only be applied to numeric data. For categorical data, the data has to be converted into numeric data through 1-of-K coding which itself causes many problems. K-prototypes, another clustering algorithm that originates from the k-means algorithm, can handle categorical data by adopting a different notion of distance. In this paper, we systematically compare these two methods through an experimental analysis. Our analysis shows that K-prototypes is more suited when the dataset is large-scaled, while the performance of k-means with 1-of-K coding is more stable. We believe these are useful heuristics for clustering methods working with highly categorical data.
Wang, F., Franco, H., Pugh, J. and Ross, R. (2016) Empirical Comparative Analysis of 1-of-K Coding and K-Prototypes in Categorical Clustering. Irish Conference on Artificial Intelligence and Cognitive Science (AICS 2016), September 20-21 2016, University College Dublin. doi:10.21427/em6q-2787