Antisampling for Estimation: An Overview
IEEE Transactions on Software Engineering
Approximating the number of unique values of an attribute without sorting
Information Systems
Duplicate record elimination in large data files
ACM Transactions on Database Systems (TODS)
Estimating block accesses in database organizations: a closed noniterative formula
Communications of the ACM
Database Design
Elementary Numerical Analysis: An Algorithmic Approach
Elementary Numerical Analysis: An Algorithmic Approach
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Performance analysis of three related assignment problems
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Estimating Block Accessses when Attributes are Correlated
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Simple Random Sampling from Relational Databases
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
The Size of Projections of Relations Satisfying a Functional Dependency
VLDB '82 Proceedings of the 8th International Conference on Very Large Data Bases
The Structural Model for Database Design
Proceedings of the 1st International Conference on the Entity-Relationship Approach to Systems Analysis and Design
On randomization in sequential and distributed algorithms
ACM Computing Surveys (CSUR)
Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Towards estimation error guarantees for distinct values
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating nested selectivity in object-oriented databases
Proceedings of the ninth international conference on Information and knowledge management
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Managing periodically updated data in relational databases: a stochastic modeling approach
Journal of the ACM (JACM)
Dynamic maintenance of data distribution for selectivity estimation
The VLDB Journal — The International Journal on Very Large Data Bases
IEEE Transactions on Knowledge and Data Engineering
Computing Iceberg Queries Efficiently
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Partitioning Algorithms for the Computation of Average Iceberg Queries
DaWaK 2000 Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Frequency Estimation of Internet Packet Streams with Limited Space
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Bitmap algorithms for counting active flows on high speed links
Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Data streaming algorithms for efficient and accurate estimation of flow size distribution
Proceedings of the joint international conference on Measurement and modeling of computer systems
A top-down approach for density-based clustering using multidimensional indexes
Journal of Systems and Software - Special issue: Performance modeling and analysis of computer systems and networks
Maintaining Implicated Statistics in Constrained Environments
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Duplicate detection in click streams
WWW '05 Proceedings of the 14th international conference on World Wide Web
A robust system for accurate real-time summaries of internet traffic
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Data streaming algorithms for accurate and efficient measurement of traffic and flow matrices
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Fast and accurate traffic matrix measurement using adaptive cardinality counting
Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Fast and reliable estimation schemes in RFID systems
Proceedings of the 12th annual international conference on Mobile computing and networking
An integrated efficient solution for computing frequent and top-k elements in data streams
ACM Transactions on Database Systems (TODS)
Bitmap algorithms for counting active flows on high-speed links
IEEE/ACM Transactions on Networking (TON)
Estimating nested selectivity in object-oriented and object-relational databases
Information and Software Technology
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Joint data streaming and sampling techniques for detection of super sources and destinations
IMC '05 Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
An algorithm for approximate counting using limited memory resources
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
WormShield: Fast Worm Signature Generation with Distributed Fingerprint Aggregation
IEEE Transactions on Dependable and Secure Computing
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A one-pass aggregation algorithm with the optimal buffer size in multidimensional OLAP
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
CountTorrent: ubiquitous access to query aggregates in dynamic and mobile sensor networks
Proceedings of the 5th international conference on Embedded networked sensor systems
Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Load shedding in network monitoring applications
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Efficient and scalable statistics gathering for large databases in Oracle 11g
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Estimating Local Cardinalities in a Multidimensional Multiset
AIMS '07 Proceedings of the 1st international conference on Autonomous Infrastructure, Management and Security: Inter-Domain Management
Note: Order statistics and estimating cardinalities of massive data sets
Discrete Applied Mathematics
Packet doppler: network monitoring using packet shift detection
CoNEXT '08 Proceedings of the 2008 ACM CoNEXT Conference
Counting Flows over Sliding Windows in High Speed Networks
NETWORKING '09 Proceedings of the 8th International IFIP-TC 6 Networking Conference
Distinct-value synopses for multiset operations
Communications of the ACM - A View of Parallel Computing
Improved approximate detection of duplicates for data streams over sliding windows
Journal of Computer Science and Technology
Predictable performance for unpredictable workloads
Proceedings of the VLDB Endowment
Pervasive and Mobile Computing
Aggregate computation over data streams
APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
An online framework for catching top spreaders and scanners
Computer Networks: The International Journal of Computer and Telecommunications Networking
A new data streaming method for locating hosts with large connection degree
GLOBECOM'09 Proceedings of the 28th IEEE conference on Global telecommunications
Energy efficient algorithms for the RFID estimation problem
INFOCOM'10 Proceedings of the 29th conference on Information communications
High-speed per-flow traffic measurement with probabilistic multiplicity counting
INFOCOM'10 Proceedings of the 29th conference on Information communications
Finding top-k elements in data streams
Information Sciences: an International Journal
Estimating top-k destinations in data streams
IPMU'10 Proceedings of the Computational intelligence for knowledge-based systems design, and 13th international conference on Information processing and management of uncertainty
Dispersion estimates for telecommunications fraud
IPMU'10 Proceedings of the Computational intelligence for knowledge-based systems design, and 13th international conference on Information processing and management of uncertainty
HR-SDBF: an approach to data-centric routing in WSNs
International Journal of High Performance Computing and Networking
On cardinality estimation protocols for wireless sensor networks
ADHOC-NOW'11 Proceedings of the 10th international conference on Ad-hoc, mobile, and wireless networks
Predictive resource management of multiple monitoring applications
IEEE/ACM Transactions on Networking (TON)
Privacy preserving gate counting with collaborative bluetooth scanners
OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems
ASIAN'04 Proceedings of the 9th Asian Computing Science conference on Advances in Computer Science: dedicated to Jean-Louis Lassez on the Occasion of His 5th Cycle Birthday
Fit a compact spread estimator in small high-speed memory
IEEE/ACM Transactions on Networking (TON)
Time-decaying bloom filters for efficient middle-tier data management
ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part III
Virtual indexing based methods for estimating node connection degrees
Computer Networks: The International Journal of Computer and Telecommunications Networking
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Generalized energy-efficient algorithms for the RFID estimation problem
IEEE/ACM Transactions on Networking (TON)
Software defined traffic measurement with OpenSketch
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Understanding RFID counting protocols
Proceedings of the 19th annual international conference on Mobile computing & networking
Spreader classification based on optimal dynamic bit sharing
IEEE/ACM Transactions on Networking (TON)
Line speed accurate superspreader identification using dynamic error compensation
Computer Communications
Mining frequent items in data stream using time fading model
Information Sciences: an International Journal
Hi-index | 0.00 |
We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.