A linear-time probabilistic counting algorithm for database applications

Authors:
Kyu-Young Whang;Brad T. Vander-Zanden;Howard M. Taylor
Affiliations:
Korea Advanced Institute of Science and Technology, Seoul, Korea;Cornell Univ., Ithaca, NY;Univ. of Delaware, Newark
Venue:
ACM Transactions on Database Systems (TODS)
Year:
1990

Citing 12
Cited 71

Antisampling for Estimation: An Overview

IEEE Transactions on Software Engineering
Approximating the number of unique values of an attribute without sorting

Information Systems
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
Estimating block accesses in database organizations: a closed noniterative formula

Communications of the ACM
Database Design

Database Design
Elementary Numerical Analysis: An Algorithmic Approach

Elementary Numerical Analysis: An Algorithmic Approach
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Performance analysis of three related assignment problems

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Estimating Block Accessses when Attributes are Correlated

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Simple Random Sampling from Relational Databases

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
The Size of Projections of Relations Satisfying a Functional Dependency

VLDB '82 Proceedings of the 8th International Conference on Very Large Data Bases
The Structural Model for Database Design

Proceedings of the 1st International Conference on the Entity-Relationship Approach to Systems Analysis and Design

On randomization in sequential and distributed algorithms

ACM Computing Surveys (CSUR)
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating nested selectivity in object-oriented databases

Proceedings of the ninth international conference on Information and knowledge management
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Managing periodically updated data in relational databases: a stochastic modeling approach

Journal of the ACM (JACM)
Dynamic maintenance of data distribution for selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Domains and Active Domains: What This Distinction Implies for the Estimation of Projection Sizes in Relational Databases

IEEE Transactions on Knowledge and Data Engineering
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Partitioning Algorithms for the Computation of Average Iceberg Queries

DaWaK 2000 Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Bitmap algorithms for counting active flows on high speed links

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Data streaming algorithms for efficient and accurate estimation of flow size distribution

Proceedings of the joint international conference on Measurement and modeling of computer systems
A top-down approach for density-based clustering using multidimensional indexes

Journal of Systems and Software - Special issue: Performance modeling and analysis of computer systems and networks
Maintaining Implicated Statistics in Constrained Environments

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
A robust system for accurate real-time summaries of internet traffic

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Data streaming algorithms for accurate and efficient measurement of traffic and flow matrices

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Fast and accurate traffic matrix measurement using adaptive cardinality counting

Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Fast and reliable estimation schemes in RFID systems

Proceedings of the 12th annual international conference on Mobile computing and networking
An integrated efficient solution for computing frequent and top-k elements in data streams

ACM Transactions on Database Systems (TODS)
Bitmap algorithms for counting active flows on high-speed links

IEEE/ACM Transactions on Networking (TON)
Estimating nested selectivity in object-oriented and object-relational databases

Information and Software Technology
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Joint data streaming and sampling techniques for detection of super sources and destinations

IMC '05 Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement
Automated worm fingerprinting

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
An algorithm for approximate counting using limited memory resources

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
WormShield: Fast Worm Signature Generation with Distributed Fingerprint Aggregation

IEEE Transactions on Dependable and Secure Computing
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A one-pass aggregation algorithm with the optimal buffer size in multidimensional OLAP

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
CountTorrent: ubiquitous access to query aggregates in dynamic and mobile sensor networks

Proceedings of the 5th international conference on Embedded networked sensor systems
Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Load shedding in network monitoring applications

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Efficient and scalable statistics gathering for large databases in Oracle 11g

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Estimating Local Cardinalities in a Multidimensional Multiset

AIMS '07 Proceedings of the 1st international conference on Autonomous Infrastructure, Management and Security: Inter-Domain Management
Note: Order statistics and estimating cardinalities of massive data sets

Discrete Applied Mathematics
Packet doppler: network monitoring using packet shift detection

CoNEXT '08 Proceedings of the 2008 ACM CoNEXT Conference
Counting Flows over Sliding Windows in High Speed Networks

NETWORKING '09 Proceedings of the 8th International IFIP-TC 6 Networking Conference
Distinct-value synopses for multiset operations

Communications of the ACM - A View of Parallel Computing
Improved approximate detection of duplicates for data streams over sliding windows

Journal of Computer Science and Technology
Predictable performance for unpredictable workloads

Proceedings of the VLDB Endowment
Fast track article: On accurate and efficient statistical counting in sensor-based surveillance systems

Pervasive and Mobile Computing
Aggregate computation over data streams

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
An online framework for catching top spreaders and scanners

Computer Networks: The International Journal of Computer and Telecommunications Networking
A new data streaming method for locating hosts with large connection degree

GLOBECOM'09 Proceedings of the 28th IEEE conference on Global telecommunications
Energy efficient algorithms for the RFID estimation problem

INFOCOM'10 Proceedings of the 29th conference on Information communications
High-speed per-flow traffic measurement with probabilistic multiplicity counting

INFOCOM'10 Proceedings of the 29th conference on Information communications
Finding top-k elements in data streams

Information Sciences: an International Journal
Estimating top-k destinations in data streams

IPMU'10 Proceedings of the Computational intelligence for knowledge-based systems design, and 13th international conference on Information processing and management of uncertainty
Dispersion estimates for telecommunications fraud

IPMU'10 Proceedings of the Computational intelligence for knowledge-based systems design, and 13th international conference on Information processing and management of uncertainty
HR-SDBF: an approach to data-centric routing in WSNs

International Journal of High Performance Computing and Networking
On cardinality estimation protocols for wireless sensor networks

ADHOC-NOW'11 Proceedings of the 10th international conference on Ad-hoc, mobile, and wireless networks
Predictive resource management of multiple monitoring applications

IEEE/ACM Transactions on Networking (TON)
Privacy preserving gate counting with collaborative bluetooth scanners

OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems
Counting by coin tossings

ASIAN'04 Proceedings of the 9th Asian Computing Science conference on Advances in Computer Science: dedicated to Jean-Louis Lassez on the Occasion of His 5th Cycle Birthday
Fit a compact spread estimator in small high-speed memory

IEEE/ACM Transactions on Networking (TON)
Time-decaying bloom filters for efficient middle-tier data management

ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part III
Virtual indexing based methods for estimating node connection degrees

Computer Networks: The International Journal of Computer and Telecommunications Networking
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Generalized energy-efficient algorithms for the RFID estimation problem

IEEE/ACM Transactions on Networking (TON)
Software defined traffic measurement with OpenSketch

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Understanding RFID counting protocols

Proceedings of the 19th annual international conference on Mobile computing & networking
Spreader classification based on optimal dynamic bit sharing

IEEE/ACM Transactions on Networking (TON)
Line speed accurate superspreader identification using dynamic error compensation

Computer Communications
Mining frequent items in data stream using time fading model

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.