RNTI

MODULAD
Similarité par recouvrement de séquences pour la fouille de données séquentielles et textuelles
In EGC 2019, vol. RNTI-E-35, pp.105-116
Abstract
This paper introduces the sequence covering similarity, that we formally define for evalu- ating the similarity between a symbolic sequence (a string) and a set of symbolic sequences (a set of strings). From this covering similarity we derive a pair-wise distance to compare two symbolic sequences. We show that this covering distance is a semi-metric. Some examples are given to show how this string semi-metric in O(n · log(n)) compares with the Levenshtein's distance that is in O(n 2 ). The first toy experiment describes an application to plagiarism de- tection. Furthermore, from the covering similarity definition, we detail a discriminative model to address sequential data classification. As a preliminary study, we evaluate this model on two benchmaks: the first one relates to a nucleotide sequence classification task, the second one to textual data classification task. On the considered tasks, the results obtained by the proposed method are quite competitive comparatively to the state of the art, including deep learning approaches.