Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic

Authors:
Ahmed Metwally;Divyakant Agrawal;Amr El Abbadi
Affiliations:
Ask.com, Oakland, CA;Ask.com, Oakland, CA;University of California at Santa Barbara, Santa Barbara, CA
Venue:
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Year:
2008

Citing 36
Cited 6

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Approximating the number of unique values of an attribute without sorting

Information Systems
Processing aggregate relational queries with hard time constraints

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
A linear-time probabilistic counting algorithm for database applications

ACM Transactions on Database Systems (TODS)
Error-constrained COUNT query evaluation in relational databases

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Bottom-up computation of sparse and Iceberg CUBE

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Statistical estimators for relational algebra expressions

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Reductions in streaming algorithms, with an application to counting triangles in graphs

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Dwarf: shrinking the PetaCube

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Distributed streams algorithms for sliding windows

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
On Estimating the Size of Projections

ICDT '90 Proceedings of the Third International Conference on Database Theory
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
QC-trees: an efficient summary structure for semantic OLAP

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Processing set expressions over continuous update streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Bitmap algorithms for counting active flows on high speed links

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Range CUBE: Efficient Cube Computation by Exploiting Data Correlation

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Optimal space lower bounds for all frequency moments

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams

Distributed and Parallel Databases
CURE for cubes: cubing using a ROLAP engine

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
On Hit Inflation Techniques and Detection in Streams of Web Advertising Networks

ICDCS '07 Proceedings of the 27th International Conference on Distributed Computing Systems
Comparing data streams using Hamming norms (how to zero in)

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Quotient cube: how to summarize the semantics of a data cube

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
High-dimensional OLAP: a minimal cubing approach

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Counting distinct items over update streams

ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation

Finding frequent items in data streams

Proceedings of the VLDB Endowment
Counting Flows over Sliding Windows in High Speed Networks

NETWORKING '09 Proceedings of the 8th International IFIP-TC 6 Networking Conference
Methods for finding frequent items in data streams

The VLDB Journal — The International Journal on Very Large Data Bases
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
InfoPuzzle: exploring group decision making in mobile peer-to-peer databases

Proceedings of the VLDB Endowment
HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm

Proceedings of the 16th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal resources. However, in some modern applications, the accuracy of the estimate is of primal importance, and businesses are willing to trade more resources for better accuracy. Throughout our experience with building a distinct count system at a major search engine, Ask.com, we reviewed the literature of approximating distinct counts, and compared most algorithms in the literature. We deduced that Linear Counting, one of the least used algorithms, has unique and impressive advantages when the accuracy of the distinct count is critical to the business. For other estimators to attain comparable accuracy, they need more space than Linear Counting. We have supported our analytical results through comprehensive experiments. The experimental results highly favor Linear Counting when the number of distinct elements is large and the error tolerance is low.