How to scalably and accurately skip past streams

Authors:
Supratik Bhattacharyya;Andre Madeira;S. Muthukrishnan;Tao Ye
Affiliations:
Sprint ATL. supratik@gmail.com;Rutgers University. amadeira@cs.rutgers.edu;Rutgers University. muthu@cs.rutgers.edu;Sprint ATL. tao.ye@sprint.com
Venue:
ICDEW '07 Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop
Year:
2007

Citing 0
Cited 8

Adaptive shared-state sampling

Proceedings of the 8th ACM SIGCOMM conference on Internet measurement
Finding frequent items in data streams

Proceedings of the VLDB Endowment
Finding the frequent items in streams of data

Communications of the ACM - A View of Parallel Computing
Methods for finding frequent items in data streams

The VLDB Journal — The International Journal on Very Large Data Bases
Mining approximate frequent closed flows over packet streams

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Space-efficient estimation of statistics over sub-sampled streams

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Scalable identification and measurement of heavy-hitters

Computer Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data stream methods look at each new item of the stream, perform a small number of operations while keeping a small amount of memory, and still perform much-needed analyses. However, in many situations, the update speed per item is extremely critical and not every item can be extensively examined. In practice, this has been addressed by only examining every Nth item from the input; decreasing the input rate by a fraction 1/N, but resulting in loss of guarantees on the accuracy of the post-hoc analyses. In this paper, we present a technique of skipping past streams and looking at only a fraction of the input. Unlike traditional methods, our skipping is performed in a principled manner based on the "norm" of the stream seen. Using this technique on top of well-known sketches, we show several-fold improvement in the update time for processing streams with a given guaranteed accuracy, for a number of stream processing problems including data summarization, heavy hitters detection and self-join size estimation. We present experimental results of our methods over synthetic data and integrate our methods into Sprint's Continuous Monitoring (CMON) system for live network traffic analyses. Furthermore, aiming at future scalable stream processing systems and going beyond state-of-art packet header analyses, we show how the packet contents can be analyzed at streaming speeds, a more challenging task because each packet content can result in many updates.