Space complexity of hierarchical heavy hitters in multi-dimensional data streams

  • Authors:
  • John Hershberger;Nisheeth Shrivastava;Subhash Suri;Csaba D. Tóth

  • Affiliations:
  • Mentor Graphics Corp., Wilsonville, OR;University of California at Santa Barbara, Santa Barbara, CA;University of California at Santa Barbara, Santa Barbara, CA;MIT, Cambridge, MA

  • Venue:
  • Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Heavy hitters, which are items occurring with frequency above a given threshold, are an important aggregation and summary tool when processing data streams or data warehouses. Hierarchical heavy hitters (HHHs) have been introduced as a natural generalization for hierarchical data domains, including multi-dimensional data. An item x in a hierarchy is called a φ-HHH if its frequency after discounting the frequencies of all its descendant hierarchical heavy hitters exceeds φn, where φ is a user-specified parameter and n is the size of the data set. Recently, single-pass schemes have been proposed for computing φ-HHHs using space roughly O(1/φ log(φn)). The frequency estimates of these algorithms, however, hold only for the total frequencies of items, and not the discounted frequencies; this leads to false positives because the discounted frequency can be significantly smaller than the total frequency. This paper attempts to explain the difficulty of finding hierarchical heavy hitters with better accuracy. We show that a single-pass deterministic scheme that computes φ-HHHs in a d-dimensional hierarchy with any approximation guarantee must use Ω(1/φd+1) space. This bound is tight: in fact, we present a data stream algorithm that can report the φ-HHHs without false positives in O(1/φd+1) space.