Finding hierarchical heavy hitters in streaming data

Authors:
Graham Cormode;Flip Korn;S. Muthukrishnan;Divesh Srivastava
Affiliations:
AT&T Labs--Research, Florham Park, NJ;AT&T Labs--Research, Florham Park, NJ;Rutgers University, Piscataway, NJ;AT&T Labs--Research, Florham Park, NJ
Venue:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Year:
2008

Citing 25
Cited 10

Data cube approximation and histograms via wavelets

Proceedings of the seventh international conference on Information and knowledge management
Multi-dimensional selectivity estimation using compressed histogram information

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Bottom-up computation of sparse and Iceberg CUBE

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Iceberg-cube computation with PC clusters

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Data-streams and histograms

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Dynamic multidimensional histograms

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Exploiting hierarchical domain structure to compute similarity

ACM Transactions on Information Systems (TOIS)
On the Computation of Multidimensional Aggregates

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Automatically inferring patterns of resource consumption in network traffic

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Gigascope: a stream database for network applications

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Holistic UDAFs at streaming speeds

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
Finding (Recently) Frequent Items in Distributed Data Streams

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Space complexity of hierarchical heavy hitters in multi-dimensional data streams

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
The generalized MDL approach for summarization

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Finding hierarchical heavy hitters in data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Efficient computation of frequent and top-k elements in data streams

ICDT'05 Proceedings of the 10th international conference on Database Theory

Data Streaming with Affinity Propagation

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Anomaly extraction in backbone networks using association rules

Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
A heuristic method of finding heavy hitter prefix pairs in IP traffic

IEEE Communications Letters
Online measurement of large traffic aggregates on commodity switches

Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Structure-aware sampling on data streams

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Structure-aware sampling on data streams

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Towards adjusting mobile devices to user's behaviour

MSM'10/MUSE'10 Proceedings of the 2010 international conference on Analysis of social media and ubiquitous data
Anomaly extraction in backbone networks using association rules

IEEE/ACM Transactions on Networking (TON)
Software defined traffic measurement with OpenSketch

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
FaRNet: Fast recognition of high-dimensional patterns from big network traffic data

Computer Networks: The International Journal of Computer and Telecommunications Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data items that arrive online as streams typically have attributes which take values from one or more hierarchies (time and geographic location, source and destination IP addresses, etc.). Providing an aggregate view of such data is important for summarization, visualization, and analysis. We develop an aggregate view based on certain organized sets of large-valued regions (“heavy hitters”) corresponding to hierarchically discounted frequency counts. We formally define the notion of hierarchical heavy hitters (HHHs). We first consider computing (approximate) HHHs over a data stream drawn from a single hierarchical attribute. We formalize the problem and give deterministic algorithms to find them in a single pass over the input. In order to analyze a wider range of realistic data streams (e.g., from IP traffic-monitoring applications), we generalize this problem to multiple dimensions. Here, the semantics of HHHs are more complex, since a “child” node can have multiple “parent” nodes. We present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees. The product of hierarchical dimensions forms a mathematical lattice structure. Our algorithms exploit this structure, and so are able to track approximate HHHs using only a small, fixed number of statistics per stored item, regardless of the number of dimensions. We show experimentally, using real data, that our proposed algorithms yields outputs which are very similar (virtually identical, in many cases) to offline computations of the exact solutions, whereas straightforward heavy-hitters-based approaches give significantly inferior answer quality. Furthermore, the proposed algorithms result in an order of magnitude savings in data structure size while performing competitively.