Representative training sets for classification and the variability of empirical distributions
In EGC 2014, vol. RNTI-E-26, pp.299-304
We propose a novel approach for the estimation of the size of training sets that are needed for constructing valid models in machine learning and data mining. We aim to provide a good representation of the underlying population without making any distributional assumptions. Our technique is based on the computation of the standard deviation of the 2- statistics of a series of samples. When successive statistics are relatively close, we assume that the samples produced represent adequately the true underlying distribution of the population, and the models learned from these samples will behave almost as well as models learned on the entire population. We validate our results by experiments involving classifiers of various levels of complexity and learning capabilities.