New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Heavy-tailed probability distributions in the World Wide Web
A practical guide to heavy tails
The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
Synopsis data structures for massive data sets
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Testing and spot-checking of data streams (extended abstract)
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Database-friendly random projections
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Space lower bounds for distance approximation in the data stream model
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Fast, small-space algorithms for approximate histogram maintenance
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Computing Iceberg Queries Efficiently
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
A simple algorithm for finding frequent elements in streams and bags
ACM Transactions on Database Systems (TODS)
An Approximate L1-Difference Algorithm for Massive Data Streams
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Optimal approximations of the frequency moments of data streams
Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Streaming and sublinear approximation of entropy and information distances
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Research issues in data stream association rule mining
ACM SIGMOD Record
High-throughput sketch update on a low-power stream processor
Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Pseudo-random number generation for sketch-based estimations
ACM Transactions on Database Systems (TODS)
New trends in information integration
Proceedings of the 2nd international conference on Ubiquitous information management and communication
Path-quality monitoring in the presence of adversaries
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A survey on algorithms for mining frequent itemsets over data streams
Knowledge and Information Systems
Distributed computation of the mode
Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing
Processing top-k queries from samples
Computer Networks: The International Journal of Computer and Telecommunications Networking
DELAY: A Lazy Approach for Mining Frequent Patterns over High Speed Data Streams
ADMA '07 Proceedings of the 3rd international conference on Advanced Data Mining and Applications
Optimistic parallelization support for event stream processing systems
Proceedings of the 5th Middleware doctoral symposium
A hardware platform for efficient worm outbreak detection
ACM Transactions on Design Automation of Electronic Systems (TODAES)
ICALP '09 Proceedings of the 36th International Colloquium on Automata, Languages and Programming: Part I
Hellinger Strikes Back: A Note on the Multi-party Information Complexity of AND
APPROX '09 / RANDOM '09 Proceedings of the 12th International Workshop and 13th International Workshop on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques
Evaluating top-k queries over incomplete data streams
Proceedings of the 18th ACM conference on Information and knowledge management
A robust approach to find effective items in distributed data streams
LSMS'07 Proceedings of the Life system modeling and simulation 2007 international conference on Bio-Inspired computational intelligence and applications
Measuring independence of datasets
Proceedings of the forty-second ACM symposium on Theory of computing
A sparse Johnson: Lindenstrauss transform
Proceedings of the forty-second ACM symposium on Theory of computing
Information complexity: a tutorial
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Mining top-K frequent itemsets through progressive sampling
Data Mining and Knowledge Discovery
1-pass relative-error Lp-sampling with applications
SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Lower bounds for sparse recovery
SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
APPROX/RANDOM'10 Proceedings of the 13th international conference on Approximation, and 14 the International conference on Randomization, and combinatorial optimization: algorithms and techniques
Speed up gradual rule mining from stream data! A B-Tree and OWA-based approach
Journal of Intelligent Information Systems
Parallelizing weighted frequency counting in high-speed network monitoring
Computer Communications
Uncovering Global Icebergs in Distributed Streams: Results and Implications
Journal of Network and Systems Management
Tight bounds for Lp samplers, finding duplicates in streams, and related problems
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Finding heavy distinct hitters in data streams
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Voting almost maximizes social welfare despite limited communication
Artificial Intelligence
Space-efficient tracking of persistent items in a massive data stream
Proceedings of the 5th ACM international conference on Distributed event-based system
Mining frequent patterns across multiple data streams
Proceedings of the 20th ACM international conference on Information and knowledge management
Compressed matrix multiplication
Proceedings of the 3rd Innovations in Theoretical Computer Science Conference
Sparser Johnson-Lindenstrauss transforms
Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Efficient monitoring of personalized hot news over Web 2.0 streams
Computer Science - Research and Development
User subjectivity in change modeling of streaming itemsets
ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Densest subgraph in streaming and MapReduce
Proceedings of the VLDB Endowment
Approximate scalable bounded space sketch for large data NLP
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Mining frequent patterns from dynamic data streams with data load management
Journal of Systems and Software
A randomized algorithm for finding frequent elements in streams using o(loglogn) space
ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Graph sketches: sparsification, spanners, and subgraphs
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce
ACM Transactions on Database Systems (TODS)
Fast large-scale approximate graph construction for NLP
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Sketch algorithms for estimating point queries in NLP
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Improved counter based algorithms for frequent pairs mining in transactional data streams
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
STRIP: stream learning of influence probabilities
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Low rank approximation and regression in input sparsity time
Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Compressed matrix multiplication
ACM Transactions on Computation Theory (TOCT) - Special issue on innovations in theoretical computer science 2012
Sparser Johnson-Lindenstrauss Transforms
Journal of the ACM (JACM)
Hi-index | 0.00 |
We present a 1-pass algorithm for estimating the most frequent items in a data stream using limited storage space. Our method relies on a data structure called a COUNT SKETCH, which allows us to reliably estimate the frequencies of frequent items in the stream. Our algorithm achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this latter problem has not been previously studied in the literature.