Author ORCID Identifier

0000-0003-1208-6991

Document Type

Article

Disciplines

Statistics

Publication Details

Statistical and Machine Learning: Methods and Applications (SAML-25) on June 5th and 6th, 2025 at TU Dublin, Ireland.

doi:10.21427/rsm9-tt41

Abstract

Clustering is a common unsupervised task in data analysis and machine learning. It deals with finding clusters of objects that are characterized by the highest similarity within the same cluster and the highest dissimilarity between different clusters. One of the most used algorithms in clustering is the popular Partitioning Around Medoids (PAM), also known as k-medoids [4, 5]. The algorithm imposes the center of clusters to be some of the data points, and it looks for a minimal value of the sum of the dissimilarity to all the objects. One of the recognized properties of such a method is its robustness toward outliers due to the minimization of the total dissimilarity to other points. This clustering technique has been designed for linear data. However, observations that cannot be depicted in an Euclidean space but instead in more complex manifolds are often encountered in applications. If so, suitable variations are needed. Particularly, this work extends the PAM algorithm to data consisting of both linear and circular variables, that is, to data lying on the surface of a cylinder. Clustering this kind of data is a quite challenging task due to the complexity of the product space. Cylindrical data are described by a circular and a linear component measured on different scales. The most important difference between the two components is that on a circular scale the data points are periodic, that is, 0∘ and 360∘ represent the same value. On a linear scale instead, the values 0∘ and 360∘ are located in different places. This information must be properly accounted for by any similarity or distance measure: standard clustering methods cannot be applied inherently to these mixed data types. Cylindrical data appear in many fields and have a wide range of applications. For instance, the HSV (Hue, Saturation, Value) color space features in color image processing has a periodic hue component [2]. In environmental sciences, meteorological data often combine wind direction with wind speed, temperature, SO2 concentration, or other air quality indicators [3]. In fire ecology, the fire perimeter orientation can be considered as a two-dimensional or a three-dimensional observation, yielding circular or spherical data. In combination with the fire size, this leads to cylindrical data [1]. The proposed cylindrical PAM method adopts a similarity measure derived from a probabilistic model. The angular component is assumed to be drawn from a von Mises distribution, while the linear components are drawn from a Gaussian distribution. The performance of the proposed cylindrical PAM is evaluated through some numerical and real examples. A comparison is also given to demonstrate its effectiveness in handling mixed data types.

DOI

https://doi.org/10.21427/rsm9-tt41

Creative Commons License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.


Share

COinS