Histogramming Data Streams with Fast Per-Item Processing

Authors:
Sudipto Guha;Piotr Indyk;S. Muthukrishnan;Martin Strauss
Affiliations:
-;-;-;-
Venue:
ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Year:
2002

Citing 8
Cited 28

The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Deriving traffic demands for operational IP networks: methodology and experience

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
Data-streams and histograms

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data

ICDE '02 Proceedings of the 18th International Conference on Data Engineering

Rangesum histograms

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Issues in data stream management

ACM SIGMOD Record
Accuracy vs. lifetime: linear sketches for approximate aggregate range queries in sensor networks

Proceedings of the 2004 joint workshop on Foundations of mobile computing
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
PADS: a domain-specific language for processing ad hoc data

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Wavelet synopsis for data streams: minimizing non-euclidean error

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Space efficiency in synopsis construction algorithms

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Approximation algorithms for wavelet transform coding of data streams

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Approximation and streaming algorithms for histogram construction problems

ACM Transactions on Database Systems (TODS)
PADS: an end-to-end system for processing ad hoc data

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
An integrated efficient solution for computing frequent and top-k elements in data streams

ACM Transactions on Database Systems (TODS)
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Window join approximation over data streams with importance semantics

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
XWAVE: optimal and approximate extended wavelets

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
REHIST: relative error histogram construction algorithms

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Window query processing for joining data streams with relations

CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Enhancing histograms by tree-like bucket indices

The VLDB Journal — The International Journal on Very Large Data Bases
Wavelet synopsis for hierarchical range queries with workloads

The VLDB Journal — The International Journal on Very Large Data Bases
Multiple-Objective Compression of Data Cubes in Cooperative OLAP Environments

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
On the space---time of optimal, approximate and streaming algorithms for synopsis construction problems

The VLDB Journal — The International Journal on Very Large Data Bases
Multiplicative synopses for relative-error metrics

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Deterministically Estimating Data Stream Frequencies

COCOA '09 Proceedings of the 3rd International Conference on Combinatorial Optimization and Applications
Consistent histograms in the presence of distinct value counts

Proceedings of the VLDB Endowment
A top-down approach for compressing data cubes under the simultaneous evaluation of multiple hierarchical range queries

Journal of Intelligent Information Systems
Workload-optimal histograms on streams

ESA'05 Proceedings of the 13th annual European conference on Algorithms
Exploiting cluster analysis for constructing multi-dimensional histograms on both static and evolving data

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases

Quantified Score

Hi-index	0.01

Visualization

Abstract

A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating Ai by Hi = bj for i 驴 Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error ||A -H||22 = 驴i |Ai-Hi|2. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression.We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that||A -H||22 驴 (1 + 驴) ||A -Hopt||22. Our algorithm considers the data items A0,A1, . . . in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ||A||, 1/驴), and determines the histogram in time poly((B, log(N), log ||A||, 1/驴). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., 驴(N), or worked longer, i.e., N log驴(1)(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.