Optimizing away joins on data streams

  • Authors:
  • Lukasz Golab;Theodore Johnson;Nick Koudas;Divesh Srivastava;David Toman

  • Affiliations:
  • AT&T Labs - Research;AT&T Labs - Research;University of Toronto;AT&T Labs - Research;University of Waterloo

  • Venue:
  • SSPS '08 Proceedings of the 2nd international workshop on Scalable stream processing system
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Monitoring aggregates on network traffic streams is a compelling application of data stream management systems. Often, streaming aggregation queries involve joining multiple inputs (e.g., client requests and server responses) using temporal join conditions (e.g., within 5 seconds), followed by computation of aggregates (e.g., COUNT) over temporal windows (e.g., every 5 minutes). These types of queries help identify malfunctioning servers (missing responses), malicious clients (bursts of requests during a denial-of-service attack), or improperly configured protocols (short timeout intervals causing many retransmissions). However, while such query expression is natural, its evaluation over massive data streams is inefficient. In this paper, we develop rewriting techniques for streaming aggregation queries that join multiple inputs. Our techniques identify conditions under which expensive joins can be optimized away, while providing error bounds for the results of the rewritten queries. The basis of the optimization is a powerful but decidable theory in which constraints over data streams can be formulated. We show the efficiency and accuracy of our solutions via experimental evaluation on real-life IP network data using the Gigascope stream processing engine.