Classification ascendante hiérarchique à noyaux et une application aux données textuelles
Abstract
Lance-Williams formula is a framework that unifies seven schemes of agglomerative hier-
archical clustering. In this paper, we establish a new expression of this formula using cosine
similarities instead of distances. We state conditions under which the new formula is equivalent
to the original one. The interest of our approach is twofold. Firstly, we can naturally extend
agglomerative hierarchical clustering techniques to kernel functions. Secondly, reasoning in
terms of similarities allows us to design thresholding strategies on proximity values. Thereby,
we propose to sparsify the similarity matrix in the goal of making these clustering techniques
more efficient. We apply our approach to text clustering tasks. Our results show that sparsify-
ing the inner product matrix considerably decreases memory usage and shortens running time
while assuring the clustering quality.