Shared query processing in data streaming systems

  • Authors:
  • Michael J. Franklin;Saileshwar Krishnamurthy

  • Affiliations:
  • University of California, Berkeley;University of California, Berkeley

  • Venue:
  • Shared query processing in data streaming systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In networked environments there is an increased proliferation of sources (e.g., seismic sensors, financial tickers) that produce live data streams. As a consequence, systems that can manage streaming data have gained tremendous importance. These systems provide declarative query-based interfaces that have enabled new classes of applications that can react to live streaming data in real time. As such applications flourish, they result in large numbers of concurrent queries that data streaming systems have to support. The traditional approach of executing concurrent queries separately can lead to resource shortages that severely limit the usefulness of such systems. A better alternative is shared query processing, an approach where a system shares its resources to cooperatively process multiple concurrent queries by exploiting the similarities among these queries. Over the past two decades there has been significant research on shared query processing that has typically led to approaches that optimize multiple concurrent queries in a static batch-oriented fashion. This static approach is, however, unsuitable in real world environments where queries join and leave the system in an unpredictable fashion. In this thesis, I reject the traditional methods of static shared query processing in favor of a more dynamic "on-the-fly" approach. In particular, I develop on-the-fly shared processing techniques for the following different kinds of queries: (1) joins with varying predicates, (2) aggregates with varying windows, and (3) joins and aggregates, with varying predicates and windows. The techniques developed in this thesis can be used to share both computation resources in single-site systems and communication resources in distributed systems. Furthermore, this thesis shows that systems that use these techniques can achieve significant improvements in scalability and performance. For instance, shared computation was shown in experiments to enable a system to support between 8 to 16 times (i.e., roughly an order of magnitude) the number of concurrent queries that a system that uses existing unshared and shared approaches can support. Similarly, shared communication was shown in experiments to enable up to a 50% reduction in bandwidth consumption as compared to earlier techniques. In summary, this thesis advances the state of the art in two important ways. The first is to demonstrate that shared query processing can offer significant scalability improvements that are crucial in data streaming systems. The second is to show that on-the-fly approaches to sharing are feasible, and can make shared query processing useful in real world scenarios.