Sélection de mesures de similarité pour la classification de données catégorielles

Amedeo Napoli, Miguel Couceiro, Guilherme Alves

In EGC 2020, vol. RNTI-E-36, pp.325-332

Abstract

Data clustering is a well-known task in data mining and it often relies on distances or, in some cases, similarity measures. The latter is indeed the case for real world datasets that comprise categorical attributes. Several similarity measures have been proposed in the literature, however, their choice depends on the context and the dataset at hand. In this paper, we address the following question: given a set of measures, which one is best suited for clustering a particular dataset? We propose an approach to automate this choice, and we present an empirical study based on categorical datasets, on which we evaluate our proposed approach.

Preview See bibtex

Download