The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Even strongly universal hashing is pretty fast
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Space-efficient online computation of quantile summaries
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Maintaining stream statistics over sliding windows: (extended abstract)
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Automated Reasoning: Essays in Honor of Woody Bledsoe
Automated Reasoning: Essays in Honor of Woody Bledsoe
Finding Frequent Items in Data Streams
ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags
ACM Transactions on Database Systems (TODS)
Tabulation based 4-universal hashing with applications to second moment estimation
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Holistic UDAFs at streaming speeds
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Medians and beyond: new aggregation techniques for sensor networks
SenSys '04 Proceedings of the 2nd international conference on Embedded networked sensor systems
Approximate counts and quantiles over sliding windows
PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An improved data stream summary: the count-min sketch and its applications
Journal of Algorithms
Simpler algorithm for estimating frequency moments of data streams
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Space- and time-efficient deterministic algorithms for biased quantiles over data streams
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A simpler and more efficient deterministic scheme for finding frequent items over sliding windows
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
Statistical analysis of sketch estimators
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Fast data stream algorithms using associative memories
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Estimating statistical aggregates on probabilistic data streams
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A near-optimal algorithm for computing the entropy of a stream
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
How to summarize the universe: dynamic maintenance of quantiles
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Reversible sketches: enabling monitoring and analysis over high-speed data streams
IEEE/ACM Transactions on Networking (TON)
Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Exponentially Decayed Aggregates on Data Streams
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
How to scalably and accurately skip past streams
ICDEW '07 Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop
Efficient computation of frequent and top-k elements in data streams
ICDT'05 Proceedings of the 10th international conference on Database Theory
Adaptive spatial partitioning for multidimensional data streams
ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation
Frequent items in streaming data: An experimental evaluation of the state-of-the-art
Data & Knowledge Engineering
Space-optimal heavy hitters with strong error bounds
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimal tracking of distributed heavy hitters and quantiles
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Finding the frequent items in streams of data
Communications of the ACM - A View of Parallel Computing
Streaming for large scale NLP: language modeling
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Thread cooperation in multicore architectures for frequency counting over multiple data streams
Proceedings of the VLDB Endowment
Proceedings of the forty-second ACM symposium on Theory of computing
Mining discriminative items in multiple data streams
World Wide Web
Space-optimal heavy hitters with strong error bounds
ACM Transactions on Database Systems (TODS)
Sketching techniques for large scale NLP
WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Sketch techniques for scaling distributional similarity to the web
GEMS '10 Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics
Efficient term cloud generation for streaming web content
ICWE'10 Proceedings of the 10th international conference on Web engineering
Parallelizing weighted frequency counting in high-speed network monitoring
Computer Communications
Uncovering Global Icebergs in Distributed Streams: Results and Implications
Journal of Network and Systems Management
Sampling based algorithms for quantile computation in sensor networks
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Optimizing data partitioning for data-parallel computing
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Mining hot calling contexts in small space
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Structure-aware sampling on data streams
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Space-efficient tracking of persistent items in a massive data stream
Proceedings of the 5th ACM international conference on Distributed event-based system
Structure-aware sampling on data streams
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
MOA-TweetReader: real-time analysis in Twitter streaming data
DS'11 Proceedings of the 14th international conference on Discovery science
Building wavelet histograms on large data in MapReduce
Proceedings of the VLDB Endowment
gSketch: on query estimation in graph streams
Proceedings of the VLDB Endowment
Approximate scalable bounded space sketch for large data NLP
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Suppressing redundancy in wireless sensor network traffic
DCOSS'10 Proceedings of the 6th IEEE international conference on Distributed Computing in Sensor Systems
Secure Distributed Data Aggregation
Foundations and Trends in Databases
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Randomized algorithms for tracking distributed count, frequencies, and ranks
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Efficient frequent item counting in multi-core hardware
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
VOXSUP: a social engagement framework
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Competitive analysis of maintaining frequent items of a stream
SWAT'12 Proceedings of the 13th Scandinavian conference on Algorithm Theory
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Fast large-scale approximate graph construction for NLP
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Sketch algorithms for estimating point queries in NLP
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Sketch-based indexing of n-words
Proceedings of the 21st ACM international conference on Information and knowledge management
Quantiles over data streams: an experimental study
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Optimus: a dynamic rewriting framework for data-parallel execution plans
Proceedings of the 8th ACM European Conference on Computer Systems
High throughput heavy hitter aggregation for modern SIMD processors
Proceedings of the Ninth International Workshop on Data Management on New Hardware
Resource/accuracy tradeoffs in software-defined measurement
Proceedings of the second ACM SIGCOMM workshop on Hot topics in software defined networking
ACM Transactions on Database Systems (TODS) - Invited papers issue
Indexing for summary queries: Theory and practice
ACM Transactions on Database Systems (TODS)
The frequent items problem in online streaming under various performance measures
FCT'13 Proceedings of the 19th international conference on Fundamentals of Computation Theory
Automated signature extraction for high volume attacks
ANCS '13 Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems
Accelerating frequent item counting with FPGA
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Indexing Word Sequences for Ranked Retrieval
ACM Transactions on Information Systems (TOIS)
Platform and applications for massive-scale streaming network analytics
IBM Journal of Research and Development
Hi-index | 0.00 |
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.