Génération de données binaires groupées à partitionnement contrôlé et évaluation de l'impact des méthodes de réduction de dimension sur ce partitionnement
Abstract
Binary data, data having two possible values, are widely used in several researches
such as protein modelling in bioinformatics. Some problems involve clustering binary
data. The availability of real data, to study the applicability of some algorithms to
a given problem, is not always obvious. This issue is even more visible in the case of
unsupervised learning, and for clustering problems. To resolve these issues, this paper
proposes a new clustered binary data generation algorithm. Indeed, this algorithm
generates clustered binary data through various parameters. These parameters are
useful to generate data with known characteristics and controlled clusters. This article
details a method of generating clustered binary data, and presents a comparison of
the dimension reduction algorithms to show the effectiveness of the generated data
in helping to choose a dimensionnality reduction algorithm that conserves clusters
separability.