Approximation and streaming algorithms for histogram construction problems

Authors:
Sudipto Guha;Nick Koudas;Kyuseok Shim
Affiliations:
University of Pennsylvania, Philadelphia, PA;University of Toronto, Ont, Canada;Seoul National University, Seoul, Korea
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2006

Citing 34
Cited 33

Equi-depth multidimensional histograms

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Approximation algorithms for NP-hard problems

Approximation algorithms for NP-hard problems
Approximate medians and other quantiles in one pass and with limited memory

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Online computation and competitive analysis

Online computation and competitive analysis
The Aqua approximate query answering system

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Synopsis data structures for massive data sets

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Progressive vector transmission

Proceedings of the 7th ACM international symposium on Advances in geographic information systems
Optimal histograms for hierarchical range queries (extended abstract)

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimal and approximate computation of summary statistics for range aggregates

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Data-streams and histograms

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Approximation algorithms

Approximation algorithms
Fast, small-space algorithms for approximate histogram maintenance

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Fast algorithms for hierarchical range histogram construction

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Dynamic multidimensional histograms

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Wavelet synopses with error guarantees

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Locally adaptive dimensionality reduction for indexing large time series databases

ACM Transactions on Database Systems (TODS)
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
An Approximate L1-Difference Algorithm for Massive Data Streams

SIAM Journal on Computing
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Universality of Serial Histograms

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Histogramming Data Streams with Fast Per-Item Processing

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Stable distributions, pseudorandom generators, embeddings and data stream computation

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Probabilistic wavelet synopses

ACM Transactions on Database Systems (TODS)
Space efficiency in synopsis construction algorithms

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Approximation algorithms for wavelet transform coding of data streams

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
REHIST: relative error histogram construction algorithms

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

A Note on Linear Time Algorithms for Maximum Error Histograms

IEEE Transactions on Knowledge and Data Engineering
Efficient Process of Top-k Range-Sum Queries over Multiple Streams with Minimized Global Error

IEEE Transactions on Knowledge and Data Engineering
Deterministic algorithms for sampling count data

Data & Knowledge Engineering
Declaring independence via the sketching of sketches

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Hierarchical synopses with optimal error guarantees

ACM Transactions on Database Systems (TODS)
On the space---time of optimal, approximate and streaming algorithms for synopsis construction problems

The VLDB Journal — The International Journal on Very Large Data Bases
A framework for estimating complex probability density structures in data streams

Proceedings of the 17th ACM conference on Information and knowledge management
Tight results for clustering and summarizing data streams

Proceedings of the 12th International Conference on Database Theory
Multiplicative synopses for relative-error metrics

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Pseudo Period Detection on Time Series Stream with Scale Smoothing

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Frequent pattern mining with uncertain data

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning from Data Streams: Synopsis and Change Detection

Proceedings of the 2008 conference on STAIRS 2008: Proceedings of the Fourth Starting AI Researchers' Symposium
Fast and effective histogram construction

Proceedings of the 18th ACM conference on Information and knowledge management
Probabilistic histograms for probabilistic data

Proceedings of the VLDB Endowment
Optimality and scalability in lattice histogram construction

Proceedings of the VLDB Endowment
Consistent histograms in the presence of distinct value counts

Proceedings of the VLDB Endowment
A Streaming Parallel Decision Tree Algorithm

The Journal of Machine Learning Research
A top-down approach for compressing data cubes under the simultaneous evaluation of multiple hierarchical range queries

Journal of Intelligent Information Systems
Approximating sliding windows by cyclic tree-like histograms for efficient range queries

Data & Knowledge Engineering
Approximation algorithms for speeding up dynamic programming and denoising aCGH data

Journal of Experimental Algorithmics (JEA)
Efficient construction of histograms for multidimensional data using quad-trees

Decision Support Systems
Distributed similarity estimation using derived dimensions

The VLDB Journal — The International Journal on Very Large Data Bases
Approximate dynamic programming using halfspace queries and multiscale Monge decomposition

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Monitoring incremental histogram distribution for change detection in data streams

Sensor-KDD'08 Proceedings of the Second international conference on Knowledge Discovery from Sensor Data
Graph sketches: sparsification, spanners, and subgraphs

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Approximating and testing k-histogram distributions in sub-linear time

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Unexpected challenges in large scale machine learning

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Histograms as statistical estimators for aggregate queries

Information Systems
Quality and efficiency for kernel density estimates in large data

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Efficient and scalable monitoring and summarization of large probabilistic data

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Entropy-based histograms for selectivity estimation

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Differentially private histogram publication

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Histograms and related synopsis structures are popular techniques for approximating data distributions. These have been successful in query optimization and a variety of applications, including approximate querying, similarity searching, and data mining, to name a few. Histograms were a few of the earliest synopsis structures proposed and continue to be used widely. The histogram construction problem is to construct the best histogram restricted to a space bound that reflects the data distribution most accurately under a given error measure.The histograms are used as quick and easy estimates. Thus, a slight loss of accuracy, compared to the optimal histogram under the given error measure, can be offset by fast histogram construction algorithms. A natural question arises in this context: Can we find a fast near optimal approximation algorithm for the histogram construction problem? In this article, we give the first linear time (1+ε)-factor approximation algorithms (for any ε 0) for a large number of histogram construction problems including the use of piecewise small degree polynomials to approximate data, workloads, etc. Several of our algorithms extend to data streams.Using synthetic and real-life data sets, we demonstrate that in many scenarios the approximate histograms are almost identical to optimal histograms in quality and are significantly faster to construct.