A super scalar sort algorithm for RISC processors

Authors:
Ramesh C. Agarwal
Affiliations:
IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY
Venue:
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Year:
1996

Citing 11
Cited 19

A measure of transaction processing power

Datamation
Parallel sorting methods for large data volumes on a hypercube database computer

Database Machines Sixth International Workshop, IWDM '89
Sorting large data files on POOMA

CONPAR 90 Proceedings of the joint international conference on Vector and parallel processing
AlphaSort: a RISC machine sort

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Characterization of alpha AXP performance using TP and SPEC workloads

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
High-performance parallel implementations of the NAS kernel benchmarks on the IBM SP2

IBM Systems Journal
Parallel sorting on a shared-nothing architecture using probabilistic splitting

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
The Art of Computer Programming Volumes 1-3 Boxed Set

The Art of Computer Programming Volumes 1-3 Boxed Set
Benchmark Handbook: For Database and Transaction Processing Systems

Benchmark Handbook: For Database and Transaction Processing Systems
AlphaSort: a cache-sensitive parallel external sort

The VLDB Journal — The International Journal on Very Large Data Bases

High-performance sorting on networks of workstations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Searching for the sorting record: experiences in tuning NOW-Sort

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Active disks: programming model, algorithms and evaluation

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Efficient bundle sorting

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Information and control in gray-box systems

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Cost effectiveness of an adaptable computing cluster

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Scalable Sweeping-Based Spatial Join

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
OLAP Query Processing Algorithm Based on Relational Storage

WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
Asynchronous parallel disk sorting

Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures
An Analysis of the Cost Effectiveness of an Adaptable Computing Cluster

Cluster Computing
Fast and approximate stream mining of quantiles and frequencies using graphics processors

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Reducing Server Data Traffic Using a Hierarchical Computation Model

IEEE Transactions on Parallel and Distributed Systems
Implementing sorting in database systems

ACM Computing Surveys (CSUR)
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Sequential in-core sorting performance for a SQL data service and for parallel sorting on heterogeneous clusters

Future Generation Computer Systems - Systems performance analysis and evaluation
An experimental study of sorting and branch prediction

Journal of Experimental Algorithmics (JEA)
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
The effect of local sort on parallel sorting algorithms

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Redesigning the string hash table, burst trie, and BST to exploit cache

Journal of Experimental Algorithmics (JEA)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The compare and branch sequences required in a traditional sort algorithm can not efficiently exploit multiple execution units present in currently available high performance RISC processors. This is because of the long latency of the compare instructions and the sequential algorithm used in sorting. With the increased level of integration on a chip, this trend is expected to continue. We have developed new sort algorithms which eliminate almost all the compares, provide functional parallelism which can be exploited by multiple execution units, significantly reduce the number of passes through keys, and improve data locality. These new algorithms outperform traditional sort algorithms by a large factor.For the Datamation disk to disk sort benchmark (one million 100-byte records), at SIGMOD'94, Chris Nyberg et al presented several new performance records using DEC alpha processor based systems.We have implemented the Datamation sort benchmark using our new sort algorithm on a desktop IBM RS/6000 model 39H (66.6 MHz) with 8 IBM SSA 7133 disk drives (total cost $73K). The total elapsed time for the 100 MB sort was 5.1 seconds (vs the old uni-processor record of 9.1 seconds). We have also established a new price performance record (0.2¢ vs the old record of 0.9¢, as the cost of the sort). The entire sort processing was overlapped with I/O. During the read phase, we achieved a sustained BW of 47 MB/sec and during the write phase, we achieved a sustained BW of 39 MB/sec. Key extraction and sorting of one million 10-byte keys took only 0.6 second of CPU time. The rest of the CPU time was used in moving records, servicing I/O, and other overheads.Algorithmic details leading to this level of performance are described in this paper. A detailed analysis of the CPU time spent during various phases of the sort algorithm and I/O is also provided.