Efficient computation of frequent and top-k elements in data streams

Authors:
Ahmed Metwally;Divyakant Agrawal;Amr El Abbadi
Affiliations:
Department of Computer Science, University of California, Santa Barbara;Department of Computer Science, University of California, Santa Barbara;Department of Computer Science, University of California, Santa Barbara
Venue:
ICDT'05 Proceedings of the 10th international conference on Database Theory
Year:
2005

Citing 10
Cited 56

Analysis of Hoare's FIND algorithm with median-of-three partition

Random Structures & Algorithms - Special issue: average-case analysis of algorithms
Algorithm 65: find

Communications of the ACM
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice

ACM Transactions on Computer Systems (TOCS)
Dynamically maintaining frequent items over a data stream

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Finding hierarchical heavy hitters in data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
Using association rules for fraud detection in web advertising networks

VLDB '05 Proceedings of the 31st international conference on Very large data bases
An integrated efficient solution for computing frequent and top-k elements in data streams

ACM Transactions on Database Systems (TODS)
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Finding hierarchical heavy hitters in network measurement system

Proceedings of the 2007 ACM symposium on Applied computing
Fast data stream algorithms using associative memories

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Sketching probabilistic data streams

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Quality-Aware Sampling and Its Applications in Incremental Data Mining

IEEE Transactions on Knowledge and Data Engineering
Finding hierarchical heavy hitters in streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Probabilistic lossy counting: an efficient algorithm for finding heavy hitters

ACM SIGCOMM Computer Communication Review
Extracting k most important groups from data efficiently

Data & Knowledge Engineering
Power-law relationship and self-similarity in the itemset support distribution: analysis and applications

The VLDB Journal — The International Journal on Very Large Data Bases
FIDS: Monitoring Frequent Items over Distributed Data Streams

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
DELAY: A Lazy Approach for Mining Frequent Patterns over High Speed Data Streams

ADMA '07 Proceedings of the 3rd international conference on Advanced Data Mining and Applications
Separator: Sifting Hierarchical Heavy Hitters Accurately from Data Streams

ADMA '07 Proceedings of the 3rd international conference on Advanced Data Mining and Applications
Clustering Distributed Sensor Data Streams

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Finding frequent items in data streams

Proceedings of the VLDB Endowment
Mining top-k Hot Melody Structures over online music query streams

Pattern Recognition Letters
Feature-preserved sampling over streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
A sliding window method for finding top-k path traversal patterns over streaming Web click-sequences

Expert Systems with Applications: An International Journal
CLIC: client-informed caching for storage servers

FAST '09 Proccedings of the 7th conference on File and storage technologies
HIDS: a multifunctional generator of hierarchical data streams

ACM SIGMIS Database
Interactive mining of top-K frequent closed itemsets from data streams

Expert Systems with Applications: An International Journal
Mining top-k maximal reference sequences from streaming web click-sequences with a damped sliding window

Expert Systems with Applications: An International Journal
Measuring evolving data streams' behavior through their intrinsic dimension

New Generation Computing
Space-optimal heavy hitters with strong error bounds

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating the confidence of conditional functional dependencies

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Clustering over Evolving Data Streams Based on Online Recent-Biased Approximation

Knowledge Acquisition: Approaches, Algorithms and Applications
Finding the frequent items in streams of data

Communications of the ACM - A View of Parallel Computing
An evaluation study of clustering algorithms in the scope of user communities assessment

Computers & Mathematics with Applications
Methods for finding frequent items in data streams

The VLDB Journal — The International Journal on Very Large Data Bases
Discovering correlated items in data streams

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Event-based lossy compression for effective and efficient OLAP over data streams

Data & Knowledge Engineering
Aggregate computation over data streams

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Mining top-K frequent itemsets through progressive sampling

Data Mining and Knowledge Discovery
Mining discriminative items in multiple data streams

World Wide Web
Space-optimal heavy hitters with strong error bounds

ACM Transactions on Database Systems (TODS)
TOPSIL-Miner: an efficient algorithm for mining top-K significant itemsets over data streams

Knowledge and Information Systems
Private and continual release of statistics

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming: Part II
Lightweight problem determination in DBMSs using data stream analysis techniques

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Clustering distributed sensor data streams using local processing and reduced communication

Intelligent Data Analysis - Ubiquitous Knowledge Discovery
A practical approach to portscan detection in very high-speed links

PAM'11 Proceedings of the 12th international conference on Passive and active measurement
Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

Data Mining and Knowledge Discovery
Space-efficient tracking of persistent items in a massive data stream

Proceedings of the 5th ACM international conference on Distributed event-based system
Private and Continual Release of Statistics

ACM Transactions on Information and System Security (TISSEC)
Mining top-k regular-frequent itemsets using database partitioning and support estimation

Expert Systems with Applications: An International Journal
MOA-TweetReader: real-time analysis in Twitter streaming data

DS'11 Proceedings of the 14th international conference on Discovery science
Discovering trending phrases on information streams

Proceedings of the 20th ACM international conference on Information and knowledge management
Error-adaptive and time-aware maintenance of frequency counts over data streams

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Tracking distributed aggregates over time-based sliding windows

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems (TODS)
Zips: mining compressing sequential patterns in streams

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
A thin monitoring layer for top-k aggregation queries over a database

Proceedings of the 7th International Workshop on Ranking in Databases
Automated signature extraction for high volume attacks

ANCS '13 Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems
FaRNet: Fast recognition of high-dimensional patterns from big network traffic data

Computer Networks: The International Journal of Computer and Telecommunications Networking
Mining top-k frequent patterns over data streams sliding window

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with tight guarantees on errors. For general data distributions, our top-k algorithm can return a set of k′ elements, where k′ ≈ k, which are guaranteed to be the top-k' elements; and we use minimal space for calculating frequent elements. For realistic Zipfian data, our space requirement for the frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, we ensure that only the top-k elements, in the correct order, are reported. Our experiments show significant space reductions with no loss in accuracy.