Register level sort algorithm on multi-core SIMD processors

Authors:
Tian Xiaochen;Kamil Rocki;Reiji Suda
Affiliations:
The University of Tokyo & CREST, JST;The University of Tokyo & CREST, JST;The University of Tokyo & CREST, JST
Venue:
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Year:
2013

Citing 12
Cited 0

Adaptive bitonic sorting: an optimal parallel algorithm for shared-memory machines

SIAM Journal on Computing
Analyzing variants of Shellsort

Information Processing Letters
An 0(n log n) sorting network

STOC '83 Proceedings of the fifteenth annual ACM symposium on Theory of computing
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Fast parallel GPU-sorting using a hybrid algorithm

Journal of Parallel and Distributed Computing
Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment
Sorting networks and their applications

AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
Designing efficient sorting algorithms for manycore GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Merge Path - Parallel Merging Made Simple

IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

State-of-the-art hardware increasingly utilizes SIMD parallelism, where multiple processing elements execute the same instruction on multiple data points simultaneously. However, irregular and data intensive algorithms are not well suited for such architectures. Due to their importance, it is crucial to obtain efficient implementations. One example of such a task is sort, a fundamental problem in computer science. In this paper we analyze distinct memory accessing models and propose two methods to employ highly efficient bitonic merge sort using SIMD instructions as register level sort. We achieve nearly 270x speedup (525M integers/s) on a 4M integer set using Xeon Phi coprocessor, where SIMD level parallelism accelerates the algorithm over 3 times. Our method can be applied to any device supporting similar SIMD instructions.