Fast string sorting using order-preserving compression

Authors:
Alejandro López-Ortiz;Mehdi Mirzazadeh;Mohammad Ali Safari;Hossein Sheikhattar
Affiliations:
University of Waterloo, Ont., Canada;University of Waterloo, Ont., Canada;University of British Columbia, Vancouver, B.C., Canada;University of Waterloo, Ont., Canada
Venue:
Journal of Experimental Algorithmics (JEA)
Year:
2005

Citing 8
Cited 1

Text compression

Text compression
Sorting in linear time?

STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
The optimal alphabetic tree problem revisited

Journal of Algorithms
Implementing radixsort

Journal of Experimental Algorithmics (JEA)
Fast algorithms for sorting and searching strings

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Dictionary-based order-preserving string compression

The VLDB Journal — The International Journal on Very Large Data Bases
Faster deterministic sorting and searching in linear space

FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science

Efficient index compression in DB2 LUW

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We give experimental evidence for the benefits of order-preserving compression in sorting algorithms. While, in general, any algorithm might benefit from compressed data because of reduced paging requirements, we identified two natural candidates that would further benefit from order-preserving compression, namely string-oriented sorting algorithms and word-RAM algorithms for keys of bounded length. The word-RAM model has some of the fastest known sorting algorithms in practice. These algorithms are designed for keys of bounded length, usually 32 or 64 bits, which limits their direct applicability for strings. One possibility is to use an order-preserving compression scheme, so that a bounded-key-length algorithm can be applied. For the case of standard algorithms, we took what is considered to be the among the fastest nonword RAM string sorting algorithms, Fast MKQSort, and measured its performance on compressed data. The Fast MKQSort algorithm of Bentley and Sedgewick is optimized to handle text strings. Our experiments show that order-compression techniques results in savings of approximately 15% over the same algorithm on noncompressed data. For the word-RAM, we modified Andersson's sorting algorithm to handle variable-length keys. The resulting algorithm is faster than the standard Unix sort by a factor of 1.5X. Last, we used an order-preserving scheme that is within a constant additive term of the optimal Hu--Tucker, but requires linear time rather than O(mlog m), where m = |Σ| is the size of the alphabet.