Identification de symboles dans des documents déstructurés

In EGC 2019, vol. RNTI-E-35, pp.297-302

Abstract

We describe an approach to efficiently extract graphical symbols from a vector file (such as PDF). After passing from a space of 2D graphic objects to a code string (1D), the identification of the symbols consists in looking for a repeated sub-sequence of codes in the input file. The works of the literature use the tree or array of suffixes. Our algorithm is based on the bucket sort algorithm in order to identify repetitions. The size and frequency, are specified by the end-user.

Preview See bibtex

Download