Practical selectivity estimation through adaptive sampling
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Error-constrained COUNT query evaluation in relational databases
SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation
SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Statistical estimators for relational algebra expressions
Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Towards estimation error guarantees for distinct values
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Accurate estimation of the number of tuples satisfying a condition
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Online Random Shuffling of Large Database Tables
IEEE Transactions on Knowledge and Data Engineering
Random Sampling for Continuous Streams with Arbitrary Updates
IEEE Transactions on Knowledge and Data Engineering
A random walk approach to sampling hidden databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Efficient Approximate Query Processing in Peer-to-Peer Networks
IEEE Transactions on Knowledge and Data Engineering
Self-tuning database systems: a decade of progress
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pay-as-you-go user feedback for dataspace systems
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Distinct value estimation on peer-to-peer networks
Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
Tagmark: reliable estimations of RFID tags for business processes
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Sample synopses for approximate answering of group-by queries
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Tuning database configuration parameters with iTuned
Proceedings of the VLDB Endowment
Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling
Data & Knowledge Engineering
Online monitoring and visualisation of database structural deterioration
International Journal of Autonomic Computing
Sampling dirty data for matching attributes
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Result-size estimation for information-retrieval subqueries
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Just-in-time analytics on large file systems
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
HASE: a hybrid approach to selectivity estimation for conjunctive predicates
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Early accurate results for advanced analytics on MapReduce
Proceedings of the VLDB Endowment
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
HEDC: a histogram estimator for data in the cloud
Proceedings of the fourth international workshop on Cloud data management
Balancing reducer skew in MapReduce workloads using progressive sampling
Proceedings of the Third ACM Symposium on Cloud Computing
You can stop early with COLA: online processing of aggregate queries in the cloud
Proceedings of the 21st ACM international conference on Information and knowledge management
Towards realistic sampling: generating dependencies in a relational database
Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Indexing for summary queries: Theory and practice
ACM Transactions on Database Systems (TODS)
Can we analyze big data inside a DBMS?
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Scalable progressive analytics on big data in the cloud
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. In this paper, we develop principled approaches to overcome this limitation of block-level sampling for histograms as well as distinct-value estimations. For histogram construction, we give a novel two-phase adaptive method in which the sample size required to reach a desired accuracy is decided based on a first phase sample. This method is significantly faster than previous iterative methods proposed for the same problem. For distinct-value estimation, we show that existing estimators designed for uniform-random samples may perform very poorly if used directly on block-level samples. We present a key technique that computes an appropriate subset of a block-level sample that is suitable for use with most existing estimators. This, to the best of our knowledge, is the first principled method for distinct-value estimation with block-level samples. We provide extensive experimental results validating our methods.