Methods for finding frequent items in data streams

Authors:
Graham Cormode;Marios Hadjieleftheriou
Affiliations:
AT&T Labs---Research, Florham Park, USA;AT&T Labs---Research, Florham Park, USA
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2010

Citing 29
Cited 11

The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Even strongly universal hashing is pretty fast

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Data streams: algorithms and applications

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
Holistic UDAFs at streaming speeds

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Medians and beyond: new aggregation techniques for sensor networks

SenSys '04 Proceedings of the 2nd international conference on Embedded networked sensor systems
Approximate counts and quantiles over sliding windows

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Simpler algorithm for estimating frequency moments of data streams

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Space- and time-efficient deterministic algorithms for biased quantiles over data streams

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A simpler and more efficient deterministic scheme for finding frequent items over sliding windows

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Statistical analysis of sketch estimators

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Fast data stream algorithms using associative memories

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Estimating statistical aggregates on probabilistic data streams

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A near-optimal algorithm for computing the entropy of a stream

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Reversible sketches: enabling monitoring and analysis over high-speed data streams

IEEE/ACM Transactions on Networking (TON)
Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Exponentially Decayed Aggregates on Data Streams

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
How to scalably and accurately skip past streams

ICDEW '07 Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop
Efficient computation of frequent and top-k elements in data streams

ICDT'05 Proceedings of the 10th international conference on Database Theory
Adaptive spatial partitioning for multidimensional data streams

ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation

Improving content delivery using provider-aided distance information

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Online measurement of large traffic aggregates on commodity switches

Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

Data Mining and Knowledge Discovery
Discovering trending phrases on information streams

Proceedings of the 20th ACM international conference on Information and knowledge management
Sketching the delay: tracking temporally uncorrelated flow-level latencies

Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference
A dynamic layout of sliding window for frequent itemset mining over data streams

Journal of Systems and Software
A randomized algorithm for finding frequent elements in streams using o(loglogn) space

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Sketch-based indexing of n-words

Proceedings of the 21st ACM international conference on Information and knowledge management
Towards never-ending learning from time series streams

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)
Mining frequent itemsets in a stream

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.