The eternal sunshine of the sketch data structure

  • Authors:
  • Xenofontas Dimitropoulos;Marc Stoecklin;Paul Hurley;Andreas Kind

  • Affiliations:
  • IBM Zurich Research Laboratory, Saumerstrasse 4, CH-8803 Zurich, Ruschlikon, Switzerland;IBM Zurich Research Laboratory, Saumerstrasse 4, CH-8803 Zurich, Ruschlikon, Switzerland;IBM Zurich Research Laboratory, Saumerstrasse 4, CH-8803 Zurich, Ruschlikon, Switzerland;IBM Zurich Research Laboratory, Saumerstrasse 4, CH-8803 Zurich, Ruschlikon, Switzerland

  • Venue:
  • Computer Networks: The International Journal of Computer and Telecommunications Networking
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the past years there has been significant research on developing compact data structures for summarizing large data streams. A family of such data structures is the so-called sketches. Sketches bear similarities to the well-known Bloom filters [B.H. Bloom, Space/time trade-offs in hash coding with allowable errors, Communications of ACM, 13 (7) (1970), 422-426] and employ hashing techniques to approximate the count associated with an arbitrary key in a data stream using fixed memory resources. One limitation of sketches is that when used for summarizing long data streams, they gradually saturate, resulting in a potentially large error on estimated key counts. In this work, we introduce two techniques to address this problem based on the observation that real-world data streams often have many transient keys that appear for short time periods and do not re-appear later on. After entering the data structure, these keys contribute to hashing collisions and thus reduce the estimation accuracy of sketches. Our techniques use a limited amount of additional memory to detect transient keys and to periodically remove their hashed values from the sketch. In this manner the number of keys hashed into a sketch decreases, and as a result the frequency of hashing collisions and the estimation error are reduced. Our first technique in effect slows down the saturation process of a sketch, whereas our second technique completely prevents a sketch from saturating. We demonstrate the performance improvements of our techniques analytically as well as experimentally. Our evaluation results using real network traffic traces show a reduction in the collision rate ranging between 26.1% and 98.2% and even higher savings in terms of estimation accuracy compared to a state-of-the-art sketch data structure. To our knowledge this is the first work to look into the problem of improving the accuracy of sketches by mitigating their saturation process.