Tight lower bounds for query processing on streaming and external memory data

  • Authors:
  • Martin Grohe;Christoph Koch;Nicole Schweikardt

  • Affiliations:
  • Institut für Informatik, Humboldt-Universität zu Berlin, Unter den Linden 6, D-10099 Berlin, Germany;Database Group, Universität des Saarlandes, Postfach 15 11 50, D-66041 Saarbrücken,Germany;Institut für Informatik, Humboldt-Universität zu Berlin, Unter den Linden 6, D-10099 Berlin, Germany

  • Venue:
  • Theoretical Computer Science
  • Year:
  • 2007

Quantified Score

Hi-index 5.23

Visualization

Abstract

It is generally assumed that databases have to reside in external, inexpensive storage because of their sheer size. Current technology for external storage systems presents us with a reality that, performance-wise, a small number of sequential scans of the data is strictly preferable over random data accesses. Database technology-in particular query processing technology-has developed around a notion of memory hierarchies with layers of greatly varying sizes and access times. It seems that the current technologies scale up to their tasks and are very successful, but on closer investigation it may appear that our theoretical understanding of the problems involved-and of optimal algorithms for these problems-is not quite as developed. Recently, data stream processing has become an object of study by the database management community, but from the viewpoint of database theory, this is really a special case of the query processing problem on data in external storage where we are limited to a single scan of the input data. In the present paper we study a clean machine model for external memory and stream processing. We establish tight bounds for the data complexity of Core XPath evaluation and filtering. We show that the number of scans of the external data induces a strict hierarchy (as long as internal memory space is sufficiently small, e.g., polylogarithmic in the size of the input). We also show that neither joins nor sorting are feasible if the product of the number r(n) of scans of the external memory and the size s(n) of the internal memory buffers is sufficiently small, i.e., of size o(n).