An Approximate L1-Difference Algorithm for Massive Data Streams

  • Authors:
  • Joan Feigenbaum;Sampath Kannan;Martin J. Strauss;Mahesh Viswanathan

  • Affiliations:
  • -;-;-;-

  • Venue:
  • SIAM Journal on Computing
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Massive data sets are increasingly important in a wide range of applications, including observational sciences, product marketing, and the monitoring and operations of large systems. In network operations, raw data typically arrive in streams, and decisions must be made by algorithms that make one pass over each stream, throw much of the raw data away, and produce "synopses" or "sketches" for further processing. Moreover, network-generated massive data sets are often distributed: Several different, physically separated network elements may receive or generate data streams that, together, comprise one logical data set; to be of use in operations, the streams must be analyzed locally and their synopses sent to a central operations facility. The enormous scale, distributed nature, and one-pass processing requirement on the data sets of interest must be addressed with new algorithmic techniques.We present one fundamental new technique here: a space-efficient, one-pass algorithm for approximating the L1-difference $\sum_i|a_i-b_i|$ between two functions, when the function values ai and bi are given as data streams, and their order is chosen by an adversary. Our main technical innovation, which may be of interest outside the realm of massive data stream algorithmics, is a method of constructing families $\{V_j(s)\}$ of limited-independence random variables that are range-summable, by which we mean that $\sum_{j=0}^{c-1} V_j(s)$ is computable in time polylog(c) for all seeds s. Our L1-difference algorithm can be viewed as a "sketching" algorithm, in the sense of [Broder et al., J. Comput. System Sci., 60 (2000), pp. 630--659], and our technique performs better than that of Broder et al. when used to approximate the symmetric difference of two sets with small symmetric difference.