Detecting and exploiting near-sortedness for efficient relational query evaluation

Authors:
Sagi Ben-Moshe;Yaron Kanza;Eldar Fischer;Arie Matsliah;Mani Fischer;Carl Staelin
Affiliations:
Technion;Technion;Technion;Technion;HP Labs;HP Labs
Venue:
Proceedings of the 14th International Conference on Database Theory
Year:
2011

Citing 5
Cited 2

Memory management during run generation in external sorting

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Spot-checkers

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Monotonicity testing over general poset domains

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Tolerant property testing and distance approximation

Journal of Computer and System Sciences
Estimating the distance to a monotone function

Random Structures & Algorithms

Edit distance to monotonicity in sliding windows

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Expressiveness and complexity of order dependencies

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.01

Visualization

Abstract

Many relational operations are best performed when the relations are stored sorted over the relevant attributes (e.g. the common attributes in a natural join operation). However, generally relations are not stored sorted because it is expensive to maintain them this way (and impossible whenever there is more than one relevant sort key). Still, many times relations turn out to be nearly-sorted, where most tuples are close to their place in the order. This state can result from "leftover sortedness", where originally sorted relations were updated, or were combined into interim results when evaluating a complex query. It can also result from weak correlations between attribute values. Currently, nearly-sorted relations are treated the same as unsorted relations, and when relational operations are evaluated for them, a generic algorithm is used. Yet, many operations can be computed more efficiently by an algorithm that exploits this near-ordering. However, to consistently benefit from using such algorithms the system should also refrain from using the wrong algorithm for relations which happen not to be sorted at all. Thus, an efficient test is required, i.e., a very fast approximation algorithm for establishing whether a given relation is sufficiently nearly-sorted. In this paper, we provide the theoretical foundations for improving query evaluation over possibly nearly-sorted relations. First we formally define what it means for a relation to be nearly-sorted, and show how operations over such relations, such as natural join, set operations and sorting, can be executed significantly more efficiently using an algorithm that we provide. If a relation is nearly-sorted enough, then it can be sorted using two sequential reads of the relation, and writing no intermediate data to disk. We then construct efficient probabilistic tests for approximating the degree of the near-sortedness of a relation without having to read an entire file. The role of our algorithms in a database management system setting is illustrated as soon as the theoretical foundation is laid out. Finally, we outline factors that relate to practical implementations of our algorithms. We show how our test can be enhanced to provide an approximation rather than just a yes-no answer, and discuss its implementability in reallife scenarios where some sparseness may be present in the database files (e.g. if they were created using a B*-tree approach). We also show how our sort can benefit distributed systems and systems that use a solid-state drive, which may very well become prevalent in the near future.