A high-performance sorting algorithm for multicore single-instruction multiple-data processors

Authors:
Hiroshi Inoue;Takao Moriyama;Hideaki Komatsu;Toshio Nakatani
Affiliations:
IBM Research, Tokyo, Japan;IBM Research, Tokyo, Japan;IBM Research, Tokyo, Japan;IBM Research, Tokyo, Japan
Venue:
Software—Practice & Experience
Year:
2012

Citing 17
Cited 1

A Benchmark Parallel Sort for Shared Memory Multiprocessors

IEEE Transactions on Computers
Introspective sorting and selection algorithms

Software—Practice & Experience
Implementing database operations using SIMD instructions

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Photon mapping on programmable graphics hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast and approximate stream mining of quantiles and frequencies using graphics processors

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Implementing sorting in database systems

ACM Computing Surveys (CSUR)
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
CellSort: high performance sorting on the cell processor

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Fast parallel GPU-sorting using a hybrid algorithm

Journal of Parallel and Distributed Computing
A Practical Quicksort Algorithm for Graphics Processors

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Sorting networks and their applications

AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
Optimized Pipelined Parallel Merge Sort on the Cell BE

Euro-Par 2008 Workshops - Parallel Processing
Designing efficient sorting algorithms for manycore GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Efficient sorting design on a novel embedded parallel computing architecture with unique memory access

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single-instruction multiple-data (SIMD) instructions and thread-level parallelism. In this paper, we propose a new high-performance sorting algorithm, called aligned-access sort (AA-sort), that exploits both the SIMD instructions and thread-level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in-core sorting phase and an out-of-core merging phase. The in-core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out-of-core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA-sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA-sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32-bit integers. Also, a parallel version of AA-sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd.