An integrated efficient solution for computing frequent and top-k elements in data streams

Authors:
Ahmed Metwally;Divyakant Agrawal;Amr El Abbadi
Affiliations:
University of California, Santa Barbara, Santa Barbara, CA;University of California, Santa Barbara, Santa Barbara, CA;University of California, Santa Barbara, Santa Barbara, CA
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2006

Citing 39
Cited 36

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
A linear-time probabilistic counting algorithm for database applications

ACM Transactions on Database Systems (TODS)
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Analysis of Hoare's FIND algorithm with median-of-three partition

Random Structures & Algorithms - Special issue: average-case analysis of algorithms
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Random sampling techniques for space efficient online computation of order statistics of large datasets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Packet classification on multiple fields

Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
NiagaraCQ: a scalable continuous query system for Internet databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Hancock: a language for extracting signatures from data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Algorithm 65: find

Communications of the ACM
On computing correlated aggregates over continual data streams

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Data-streams and histograms

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Dynamic Maintenance of Wavelet-Based Histograms

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Histogramming Data Streams with Fast Per-Item Processing

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Issues in data stream management

ACM SIGMOD Record
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice

ACM Transactions on Computer Systems (TOCS)
Distributed top-k monitoring

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Identifying frequent items in sliding windows over on-line packet streams

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
A Web page prediction model based on click-stream tree representation of user behavior

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Dynamically maintaining frequent items over a data stream

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
StatStream: statistical monitoring of thousands of data streams in real time

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Finding hierarchical heavy hitters in data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Efficient computation of frequent and top-k elements in data streams

ICDT'05 Proceedings of the 10th international conference on Database Theory

Best position algorithms for top-k queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Finding frequent items in probabilistic data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SLEUTH: Single-pubLisher attack dEtection Using correlaTion Hunting

Proceedings of the VLDB Endowment
CAM conscious integrated answering of frequent elements and top-k queries over data streams

Proceedings of the 4th international workshop on Data management on new hardware
Efficient Single-Pass Mining of Weighted Interesting Patterns

AI '08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
Frequent items in streaming data: An experimental evaluation of the state-of-the-art

Data & Knowledge Engineering
Handling Dynamic Weights in Weighted Frequent Pattern Mining

IEICE - Transactions on Information and Systems
Optimal tracking of distributed heavy hitters and quantiles

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Small synopses for group-by query verification on outsourced data streams

ACM Transactions on Database Systems (TODS)
Online Evaluation of Patterns from Evolving Web Data Streams

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Space-economical partial gram indices for exact substring matching

Proceedings of the 18th ACM conference on Information and knowledge management
Thread cooperation in multicore architectures for frequency counting over multiple data streams

Proceedings of the VLDB Endowment
Incorporating prediction models in the SelfLet framework: a plugin approach

Proceedings of the Fourth International ICST Conference on Performance Evaluation Methodologies and Tools
Comments on “an integrated efficient solution for computing frequent and top-k elements in data streams”

ACM Transactions on Database Systems (TODS)
An online framework for catching top spreaders and scanners

Computer Networks: The International Journal of Computer and Telecommunications Networking
Optimal sampling from distributed streams

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Parallelizing weighted frequency counting in high-speed network monitoring

Computer Communications
Best position algorithms for efficient top-k query processing

Information Systems
Beyond simple aggregates: indexing for summary queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Mining hot calling contexts in small space

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Data-driven modeling and analysis of online social networks

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Lower bounds for number-in-hand multiparty communication complexity, made easy

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Continuous sampling from distributed streams

Journal of the ACM (JACM)
Single-pass incremental and interactive mining for weighted frequent patterns

Expert Systems with Applications: An International Journal
Mergeable summaries

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Randomized algorithms for tracking distributed count, frequencies, and ranks

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Survey: Streaming techniques and data aggregation in networks of tiny artefacts

Computer Science Review
Interactive mining of high utility patterns over data streams

Expert Systems with Applications: An International Journal
Efficient frequent item counting in multi-core hardware

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Improved counter based algorithms for frequent pairs mining in transactional data streams

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Scalable identification and measurement of heavy-hitters

Computer Communications
High throughput heavy hitter aggregation for modern SIMD processors

Proceedings of the Ninth International Workshop on Data Management on New Hardware
Mergeable summaries

ACM Transactions on Database Systems (TODS) - Invited papers issue
Indexing for summary queries: Theory and practice

ACM Transactions on Database Systems (TODS)
Identifying streaming frequent items in ad hoc time windows

Data & Knowledge Engineering
Accelerating frequent item counting with FPGA

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and top-k elements with tight guarantees on errors. For general data distributions, our top-k algorithm returns k elements that have roughly the highest frequencies; and it uses limited space for calculating frequent elements. For realistic Zipfian data, the space requirement of the proposed algorithm for solving the exact frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, the analysis ensures that only the top-k elements, in the correct order, are reported. The experiments, using real and synthetic data sets, show space reductions with hardly any loss in accuracy. Having proved the effectiveness of the proposed approach through both analysis and experiments, we extend it to be able to answer continuous queries about frequent and top-k elements. Although the problems of incremental reporting of frequent and top-k elements are useful in many applications, to the best of our knowledge, no solution has been proposed.