Representative training sets for classification and the variability of empirical distributions
Abstract
We propose a novel approach for the estimation of the size of training
sets that are needed for constructing valid models in machine learning and data
mining. We aim to provide a good representation of the underlying population
without making any distributional assumptions.
Our technique is based on the computation of the standard deviation of the 2-
statistics of a series of samples. When successive statistics are relatively close,
we assume that the samples produced represent adequately the true underlying
distribution of the population, and the models learned from these samples will
behave almost as well as models learned on the entire population.
We validate our results by experiments involving classifiers of various levels of
complexity and learning capabilities.