Histogramming Data Streams with Fast Per-Item Processing

  • Authors:
  • Sudipto Guha;Piotr Indyk;S. Muthukrishnan;Martin Strauss

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
  • Year:
  • 2002

Quantified Score

Hi-index 0.01

Visualization

Abstract

A vector A of length N can be approximately represented by a histogram H, by writing [0,N) as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating Ai by Hi = bj for i 驴 Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error ||A -H||22 = 驴i |Ai-Hi|2. Numerous applications in statistics, signal processing and databases rely on histograms; typically B is (significantly) smaller than N and, hence, representing A by H yields substantial compression.We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that||A -H||22 驴 (1 + 驴) ||A -Hopt||22. Our algorithm considers the data items A0,A1, . . . in order, i.e., in one pass, spends processing time O(1) per item, uses total space B poly(log(N), log ||A||, 1/驴), and determines the histogram in time poly((B, log(N), log ||A||, 1/驴). Our algorithm is suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using significantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., 驴(N), or worked longer, i.e., N log驴(1)(N) total time over the N data items. Our algorithm is the first that simultaneously uses small space as well as runs fast, taking O(1) worst case time for per-item processing. In addition, our algorithm is quite simple.