Stratified reservoir sampling over heterogeneous data streams

  • Authors:
  • Mohammed Al-Kateb;Byung Suk Lee

  • Affiliations:
  • Department of Computer Science, The University of Vermont, Burlington, VT;Department of Computer Science, The University of Vermont, Burlington, VT

  • Venue:
  • SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Reservoir sampling is a well-known technique for random sampling over data streams. In many streaming applications, however, an input stream may be naturally heterogeneous, i.e., composed of sub-streams whose statistical properties may also vary considerably. For this class of applications, the conventional reservoir sampling technique does not guarantee a statistically sufficient number of tuples from each substream to be included in the reservoir, and this can cause a damage on the statistical quality of the sample. In this paper, we deal with this heterogeneity problem by stratifying the reservoir sample among the underlying sub-streams. We particularly consider situations in which the stratified reservoir sample is needed to obtain reliable estimates at the level of either the entire data stream or individual sub-streams. The first challenge in this stratification is to achieve an optimal allocation of a fixed-size reservoir to individual sub-streams. The second challenge is to adaptively adjust the allocation as sub-streams appear in, or disappear from, the input stream and as their statistical properties change over time. We present a stratified reservoir sampling algorithm designed to meet these challenges, and demonstrate through experiments the superior sample quality and the adaptivity of the algorithm.