On reservoir sampling with deletions
In MODULAD 2007, vol. Modulad 36, pp.14-17
Perhaps the most flexible synopsis of a database is a random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. In this paper, we describe a recently proposed method for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. Our scheme, called "random pairing" (RP), maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the almost 40-year-old reservoir sampling algorithm to handle deletions; RP reduces to the "passive" algorithm in [1] when the insertions and deletions correspond to a moving window over a data stream. We also prove that it is not possible to "resize" a bounded-size random sample upwards without accessing the base data.