Tight results for clustering and summarizing data streams

Authors:
Sudipto Guha
Affiliations:
University of Pennsylvania, Philadelphia, PA
Venue:
Proceedings of the 12th International Conference on Database Theory
Year:
2009

Citing 22
Cited 6

A unified approach to approximation algorithms for bottleneck problems

Journal of the ACM (JACM)
Incremental clustering and dynamic information retrieval

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Communication complexity

Communication complexity
The Aqua approximate query answering system

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Progressive vector transmission

Proceedings of the 7th ACM international symposium on Advances in geographic information systems
Local search heuristic for k-median and facility location problems

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Data-streams and histograms

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Wavelet synopses with error guarantees

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Locally adaptive dimensionality reduction for indexing large time series databases

ACM Transactions on Database Systems (TODS)
Lectures on Discrete Geometry

Lectures on Discrete Geometry
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Universality of Serial Histograms

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
Better streaming algorithms for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Online Facility Location

FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
Asymmetric k-center is log* n-hard to approximate

Journal of the ACM (JACM)
Space efficiency in synopsis construction algorithms

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Approximation and streaming algorithms for histogram construction problems

ACM Transactions on Database Systems (TODS)
A Note on Linear Time Algorithms for Maximum Error Histograms

IEEE Transactions on Knowledge and Data Engineering
Exploiting duality in summarization with deterministic guarantees

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
REHIST: relative error histogram construction algorithms

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Approximation Algorithms for Wavelet Transform Coding of Data Streams

IEEE Transactions on Information Theory

EDISKCO: energy efficient distributed in-sensor-network k-center clustering with outliers

Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data
Summarization for geographically distributed data streams

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part III
Mergeable summaries

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Mergeable summaries

ACM Transactions on Database Systems (TODS) - Invited papers issue
Data stream clustering: A survey

ACM Computing Surveys (CSUR)
Streaming with minimum space: An algorithm for covering by two congruent balls

Theoretical Computer Science

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper we investigate algorithms and lower bounds for summarization problems over a single pass data stream. In particular we focus on histogram construction and K-center clustering. We provide a simple framework that improves upon all previous algorithms on these problems in either the space bound, the approximation factor or the running time. The framework uses a notion of "streamstrapping" where summaries created for the initial prefixes of the data are used to develop better approximation algorithms. We also prove the first non-trivial lower bounds for these problems. We show that the stricter requirement that if an algorithm accurately approximates the error of every bucket or every cluster produced by it, then these upper bounds are almost the best possible. This property of accurate estimation is true of all known upper bounds on these problems.