Detecting and exploiting near-sortedness for efficient relational query evaluation

  • Authors:
  • Sagi Ben-Moshe;Yaron Kanza;Eldar Fischer;Arie Matsliah;Mani Fischer;Carl Staelin

  • Affiliations:
  • Technion;Technion;Technion;Technion;HP Labs;HP Labs

  • Venue:
  • Proceedings of the 14th International Conference on Database Theory
  • Year:
  • 2011

Quantified Score

Hi-index 0.01

Visualization

Abstract

Many relational operations are best performed when the relations are stored sorted over the relevant attributes (e.g. the common attributes in a natural join operation). However, generally relations are not stored sorted because it is expensive to maintain them this way (and impossible whenever there is more than one relevant sort key). Still, many times relations turn out to be nearly-sorted, where most tuples are close to their place in the order. This state can result from "leftover sortedness", where originally sorted relations were updated, or were combined into interim results when evaluating a complex query. It can also result from weak correlations between attribute values. Currently, nearly-sorted relations are treated the same as unsorted relations, and when relational operations are evaluated for them, a generic algorithm is used. Yet, many operations can be computed more efficiently by an algorithm that exploits this near-ordering. However, to consistently benefit from using such algorithms the system should also refrain from using the wrong algorithm for relations which happen not to be sorted at all. Thus, an efficient test is required, i.e., a very fast approximation algorithm for establishing whether a given relation is sufficiently nearly-sorted. In this paper, we provide the theoretical foundations for improving query evaluation over possibly nearly-sorted relations. First we formally define what it means for a relation to be nearly-sorted, and show how operations over such relations, such as natural join, set operations and sorting, can be executed significantly more efficiently using an algorithm that we provide. If a relation is nearly-sorted enough, then it can be sorted using two sequential reads of the relation, and writing no intermediate data to disk. We then construct efficient probabilistic tests for approximating the degree of the near-sortedness of a relation without having to read an entire file. The role of our algorithms in a database management system setting is illustrated as soon as the theoretical foundation is laid out. Finally, we outline factors that relate to practical implementations of our algorithms. We show how our test can be enhanced to provide an approximation rather than just a yes-no answer, and discuss its implementability in reallife scenarios where some sparseness may be present in the database files (e.g. if they were created using a B*-tree approach). We also show how our sort can benefit distributed systems and systems that use a solid-state drive, which may very well become prevalent in the near future.