Similarité par recouvrement de séquences pour la fouille de données séquentielles et textuelles
Abstract
This paper introduces the sequence covering similarity, that we formally define for evalu-
ating the similarity between a symbolic sequence (a string) and a set of symbolic sequences
(a set of strings). From this covering similarity we derive a pair-wise distance to compare two
symbolic sequences. We show that this covering distance is a semi-metric. Some examples are
given to show how this string semi-metric in O(n · log(n)) compares with the Levenshtein's
distance that is in O(n
2
). The first toy experiment describes an application to plagiarism de-
tection. Furthermore, from the covering similarity definition, we detail a discriminative model
to address sequential data classification. As a preliminary study, we evaluate this model on two
benchmaks: the first one relates to a nucleotide sequence classification task, the second one to
textual data classification task. On the considered tasks, the results obtained by the proposed
method are quite competitive comparatively to the state of the art, including deep learning
approaches.