Non-disjoint grouping of text documents based Word Sequence Kernel
Abstract
This paper deals with two issues in text clustering which are the detection of non disjoint groups and the representation of textual data. In fact, a text document can discuss several themes and then, it must belong to several groups. The learning algorithm must be able to produce non disjoint clusters and assigns documents to several clusters. The second issue concerns the data representation. Textual data are often represented as a bag of features such as terms, phrases or concepts. This representation of text avoids correlation between terms and doesn't give importance to the order of words in the text. We propose a non supervised learning method able to detect overlapping groups in text document by considering text as a sequence of words and using the Word Sequence Kernel as similarity measure. The experiments show that the proposed method outperforms existing overlapping methods using the bag of word representation in terms of clustering accuracy and detect more relevant groups in textual documents.