What's hot and what's not: tracking most frequent items dynamically

Authors:
Graham Cormode;S. Muthukrishnan
Affiliations:
Rutgers University, Murray Hill, NJ;Rutgers University, Piscataway, NJ
Venue:
ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Year:
2005

Citing 25
Cited 33

Optimal histograms for limiting worst-case error propagation in the size of join results

ACM Transactions on Database Systems (TODS)
Randomized algorithms

Randomized algorithms
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Communication complexity

Communication complexity
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Synopsis data structures for massive data sets

External memory algorithms
Even strongly universal hashing is pretty fast

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Fast, small-space algorithms for approximate histogram maintenance

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Querying and mining data streams: you only get one look a tutorial

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Data Structures and Algorithms

Data Structures and Algorithms
New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Data streams: algorithms and applications

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Online Data Mining for Co-Evolving Time Sequences

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Distributed top-k monitoring

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Research issues in data stream association rule mining

ACM SIGMOD Record
DSM-PLW: single-pass mining of path traversal patterns over streaming web click-sequences

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Removing manually generated boilerplate from electronic texts: experiments with project Gutenberg e-books

CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Explicit constructions for compressed sensing of sparse signals

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Finding popular categories for RFID tags

Proceedings of the 9th ACM international symposium on Mobile ad hoc networking and computing
Entity categorization over large document collections

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Online mining of frequent sets in data streams with error guarantee

Knowledge and Information Systems
Memory Efficient Algorithm for Mining Recent Frequent Items in a Stream

RSEISP '07 Proceedings of the international conference on Rough Sets and Intelligent Systems Paradigms
Efficiently Discovering Recent Frequent Items in Data Streams

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Pruning attribute values from data cubes with diamond dicing

IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
Adaptive shared-state sampling

Proceedings of the 8th ACM SIGCOMM conference on Internet measurement
Efficient single-pass frequent pattern mining using a prefix-tree

Information Sciences: an International Journal
Frequent items in streaming data: An experimental evaluation of the state-of-the-art

Data & Knowledge Engineering
Competitive Analysis of Aggregate Max in Windowed Streaming

ICALP '09 Proceedings of the 36th International Colloquium on Automata, Languages and Programming: Part I
Weighted superimposed codes and constrained integer compressed sensing

IEEE Transactions on Information Theory
Compressed sensing with probabilistic measurements: a group testing solution

Allerton'09 Proceedings of the 47th annual Allerton conference on Communication, control, and computing
An online framework for catching top spreaders and scanners

Computer Networks: The International Journal of Computer and Telecommunications Networking
Fast Manhattan sketches in data streams

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Action prediction of opponents in MMORPG using data stream mining approach with heuristic motions

ISTASC'10 Proceedings of the 10th WSEAS international conference on Systems theory and scientific computation
Efficiently decodable non-adaptive group testing

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Increasing availability of industrial systems through data stream mining

Computers and Industrial Engineering
Bounds for nonadaptive group tests to estimate the amount of defectives

COCOA'10 Proceedings of the 4th international conference on Combinatorial optimization and applications - Volume Part II
Efficiently decodable error-correcting list disjunct matrices and applications

ICALP'11 Proceedings of the 38th international colloquim conference on Automata, languages and programming - Volume Part I
Data-driven modeling and analysis of online social networks

WAIM'11 Proceedings of the 12th international conference on Web-age information management
EStream: online mining of frequent sets with precise error guarantee

DaWaK'06 Proceedings of the 8th international conference on Data Warehousing and Knowledge Discovery
Randomized group testing both query-optimal and minimal adaptive

SOFSEM'12 Proceedings of the 38th international conference on Current Trends in Theory and Practice of Computer Science
DBToaster: higher-order delta processing for dynamic, frequently fresh views

Proceedings of the VLDB Endowment
Noise-resilient group testing: Limitations and constructions

Discrete Applied Mathematics
Space-efficient straggler identification in round-trip data streams via newton's identities and invertible bloom filters

WADS'07 Proceedings of the 10th international conference on Algorithms and Data Structures
CR-PRECIS: a deterministic summary structure for update data streams

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies
Scalable identification and measurement of heavy-hitters

Computer Communications
Identifying streaming frequent items in ad hoc time windows

Data & Knowledge Engineering
An efficient FPRAS type group testing procedure to approximate the number of defectives

Journal of Combinatorial Optimization

Quantified Score

Hi-index	0.06

Visualization

Abstract

Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications.We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and, when required, quickly output all hot items without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from “group testing.” They are simple to implement, and have provable quality, space, and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees cannot handle deletions, and those that handle deletions cannot make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.