Identification de symboles dans des documents déstructurés
Abstract
We describe an approach to efficiently extract graphical symbols from a vector file (such as
PDF). After passing from a space of 2D graphic objects to a code string (1D), the identification
of the symbols consists in looking for a repeated sub-sequence of codes in the input file. The
works of the literature use the tree or array of suffixes. Our algorithm is based on the bucket
sort algorithm in order to identify repetitions. The size and frequency, are specified by the
end-user.