Tracking join and self-join sizes in limited storage
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
New directions in traffic measurement and accounting
Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Data streams: algorithms and applications
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Computing Iceberg Queries Efficiently
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries
Proceedings of the 27th International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams
ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Sketch-based change detection: methods, evaluation, and applications
Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Data streaming algorithms for efficient and accurate estimation of flow size distribution
Proceedings of the joint international conference on Measurement and modeling of computer systems
Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications
Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
Reversible sketches for efficient and accurate change detection over network data streams
Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
An improved data stream summary: the count-min sketch and its applications
Journal of Algorithms
Detection and identification of network anomalies using sketch subspaces
Proceedings of the 6th ACM SIGCOMM conference on Internet measurement
Improving sketch reconstruction accuracy using linear least squares method
IMC '05 Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
How to summarize the universe: dynamic maintenance of quantiles
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Accurate network anomaly classification with generalized entropy metrics
Computer Networks: The International Journal of Computer and Telecommunications Networking
Sketch-based querying of distributed sliding-window data streams
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
In the past years there has been significant research on developing compact data structures for summarizing large data streams. A family of such data structures is the so-called sketches. Sketches bear similarities to the well-known Bloom filters [B.H. Bloom, Space/time trade-offs in hash coding with allowable errors, Communications of ACM, 13 (7) (1970), 422-426] and employ hashing techniques to approximate the count associated with an arbitrary key in a data stream using fixed memory resources. One limitation of sketches is that when used for summarizing long data streams, they gradually saturate, resulting in a potentially large error on estimated key counts. In this work, we introduce two techniques to address this problem based on the observation that real-world data streams often have many transient keys that appear for short time periods and do not re-appear later on. After entering the data structure, these keys contribute to hashing collisions and thus reduce the estimation accuracy of sketches. Our techniques use a limited amount of additional memory to detect transient keys and to periodically remove their hashed values from the sketch. In this manner the number of keys hashed into a sketch decreases, and as a result the frequency of hashing collisions and the estimation error are reduced. Our first technique in effect slows down the saturation process of a sketch, whereas our second technique completely prevents a sketch from saturating. We demonstrate the performance improvements of our techniques analytically as well as experimentally. Our evaluation results using real network traffic traces show a reduction in the collision rate ranging between 26.1% and 98.2% and even higher savings in terms of estimation accuracy compared to a state-of-the-art sketch data structure. To our knowledge this is the first work to look into the problem of improving the accuracy of sketches by mitigating their saturation process.