The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Deriving traffic demands for operational IP networks: methodology and experience
Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Optimal Histograms with Quality Guarantees
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries
Proceedings of the 27th International Conference on Very Large Data Bases
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Issues in data stream management
ACM SIGMOD Record
Accuracy vs. lifetime: linear sketches for approximate aggregate range queries in sensor networks
Proceedings of the 2004 joint workshop on Foundations of mobile computing
Duplicate detection in click streams
WWW '05 Proceedings of the 14th international conference on World Wide Web
PADS: a domain-specific language for processing ad hoc data
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Wavelet synopsis for data streams: minimizing non-euclidean error
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Space efficiency in synopsis construction algorithms
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Approximation algorithms for wavelet transform coding of data streams
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Approximation and streaming algorithms for histogram construction problems
ACM Transactions on Database Systems (TODS)
PADS: an end-to-end system for processing ad hoc data
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
An integrated efficient solution for computing frequent and top-k elements in data streams
ACM Transactions on Database Systems (TODS)
Data streams: algorithms and applications
Foundations and Trends® in Theoretical Computer Science
Window join approximation over data streams with importance semantics
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
The history of histograms (abridged)
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
XWAVE: optimal and approximate extended wavelets
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
REHIST: relative error histogram construction algorithms
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Window query processing for joining data streams with relations
CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Enhancing histograms by tree-like bucket indices
The VLDB Journal — The International Journal on Very Large Data Bases
Wavelet synopsis for hierarchical range queries with workloads
The VLDB Journal — The International Journal on Very Large Data Bases
Multiple-Objective Compression of Data Cubes in Cooperative OLAP Environments
ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
The VLDB Journal — The International Journal on Very Large Data Bases
Multiplicative synopses for relative-error metrics
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Deterministically Estimating Data Stream Frequencies
COCOA '09 Proceedings of the 3rd International Conference on Combinatorial Optimization and Applications
Consistent histograms in the presence of distinct value counts
Proceedings of the VLDB Endowment
Journal of Intelligent Information Systems
Workload-optimal histograms on streams
ESA'05 Proceedings of the 13th annual European conference on Algorithms
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Hi-index | 0.01 |
A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating Ai by Hi = bj for i 驴 Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error ||A -H||22 = 驴i |Ai-Hi|2. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression.We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that||A -H||22 驴 (1 + 驴) ||A -Hopt||22. Our algorithm considers the data items A0,A1, . . . in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ||A||, 1/驴), and determines the histogram in time poly((B, log(N), log ||A||, 1/驴). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., 驴(N), or worked longer, i.e., N log驴(1)(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.