Nettoyage de données guidé par la sémantique inter-colonnes
Résumé
Today, the volume of unstructured and heterogeneous data is exploding, coming from multiple
sources with different levels of quality. Therefore, it is very likely to manipulate data
without knowledge about their structures and their semantics. In fact, the meta-data may be
insufficient or totally absent. Data anomalies may be due to the poverty of their semantic descriptions,
or even the absence of their descriptions. We propose an approach to understand
better the semantics and the structure of the data. It helps to correct the intra-column anomalies
(homogenization) and then the inter-columns ones caused by the violation of semantic
dependencies.