Similarité par recouvrement de séquences pour la fouille de données séquentielles et textuelles

Pierre-François Marteau, Nicolas Béchet, Oussama Ahmia

In EGC 2019, vol. RNTI-E-35, pp.105-116

Abstract

This paper introduces the sequence covering similarity, that we formally define for evalu- ating the similarity between a symbolic sequence (a string) and a set of symbolic sequences (a set of strings). From this covering similarity we derive a pair-wise distance to compare two symbolic sequences. We show that this covering distance is a semi-metric. Some examples are given to show how this string semi-metric in O(n · log(n)) compares with the Levenshtein's distance that is in O(n 2 ). The first toy experiment describes an application to plagiarism de- tection. Furthermore, from the covering similarity definition, we detail a discriminative model to address sequential data classification. As a preliminary study, we evaluate this model on two benchmaks: the first one relates to a nucleotide sequence classification task, the second one to textual data classification task. On the considered tasks, the results obtained by the proposed method are quite competitive comparatively to the state of the art, including deep learning approaches.

Preview See bibtex

Download