Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
Physical database design for relational databases
ACM Transactions on Database Systems (TODS)
Surpassing the information theoretic bound with fusion trees
Journal of Computer and System Sciences - Special issue: papers from the 22nd ACM symposium on the theory of computing, May 14–16, 1990
Randomized algorithms
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
Balls and bins: a study in negative dependence
Random Structures & Algorithms
The Aqua approximate query answering system
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Design of Dynamic Data Structures
Design of Dynamic Data Structures
Reductions in streaming algorithms, with an application to counting triangles in graphs
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Mining database structure; or, how to build a data quality browser
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Comparing Data Streams Using Hamming Norms (How to Zero In)
IEEE Transactions on Knowledge and Data Engineering
Multi-dimensional clustering: a new data layout scheme in DB2
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Tight Lower Bounds for the Distinct Elements Problem
FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
On Universal Classes of Extremely Random Constant-Time Hash Functions
SIAM Journal on Computing
Optimal space lower bounds for all frequency moments
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Algorithms for dynamic geometric problems over data streams
STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Bitmap algorithms for counting active flows on high-speed links
IEEE/ACM Transactions on Networking (TON)
Counting distinct items over update streams
Theoretical Computer Science
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Compact dictionaries for variable-length keys and data with applications
ACM Transactions on Algorithms (TALG)
Uniform Hashing in Constant Time and Optimal Space
SIAM Journal on Computing
A Multi-Round Communication Lower Bound for Gap Hamming and Some Consequences
CCC '09 Proceedings of the 2009 24th Annual IEEE Conference on Computational Complexity
On the exact space complexity of sketching and streaming small norms
SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Toward automated large-scale information integration and discovery
Data Management in a Connected World
Proceedings of the forty-second ACM symposium on Theory of computing
APPROX/RANDOM'10 Proceedings of the 13th international conference on Approximation, and 14 the International conference on Randomization, and combinatorial optimization: algorithms and techniques
Exponential time improvement for min-wise based algorithms
Information and Computation
Tight bounds for Lp samplers, finding duplicates in streams, and related problems
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Pan-private algorithms via statistics on sketches
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Finding heavy distinct hitters in data streams
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
An optimal lower bound on the communication complexity of gap-hamming-distance
Proceedings of the forty-third annual ACM symposium on Theory of computing
Proceedings of the forty-third annual ACM symposium on Theory of computing
Space-efficient tracking of persistent items in a massive data stream
Proceedings of the 5th ACM international conference on Distributed event-based system
Periodicity and cyclic shifts via linear sketches
APPROX'11/RANDOM'11 Proceedings of the 14th international workshop and 15th international conference on Approximation, randomization, and combinatorial optimization: algorithms and techniques
Optimal bounds for Johnson-Lindenstrauss transforms and streaming problems with sub-constant error
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Exponential time improvement for min-wise based algorithms
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
On power-law distributed balls in bins and its applications to view size estimation
ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Graph sketches: sparsification, spanners, and subgraphs
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Tight bounds for distributed functional monitoring
STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing
Reordering rows for better compression: Beyond the lexicographic order
ACM Transactions on Database Systems (TODS)
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce
ACM Transactions on Database Systems (TODS)
Proceedings of the 16th International Conference on Extending Database Technology
Optimal Bounds for Johnson-Lindenstrauss Transforms and Streaming Problems with Subconstant Error
ACM Transactions on Algorithms (TALG) - Special Issue on SODA'11
Homomorphic fingerprints under misalignments: sketching edit and shift distances
Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Understanding RFID counting protocols
Proceedings of the 19th annual international conference on Mobile computing & networking
Arthur-Merlin streaming complexity
ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
Sketching for big data recommender systems using fast pseudo-random fingerprints
ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part II
Efficient sampling of non-strict turnstile data streams
FCT'13 Proceedings of the 19th international conference on Fundamentals of Computation Theory
Estimating duplication by content-based sampling
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Hi-index | 0.00 |
We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1,...,n}, our algorithm computes a (1 ± ε)-approximation using an optimal O(1/ε-2 + log(n)) bits of space with 2/3 success probability, where 0O(1) worst-case time, and can report an estimate at any point midstream in O(1) worst-case time, thus settling both the space and time complexities simultaneously. We also give an algorithm to estimate the Hamming norm of a stream, a generalization of the number of distinct elements, which is useful in data cleaning, packet tracing, and database auditing. Our algorithm uses nearly optimal space, and has optimal O(1) update and reporting times.