Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Authors:
Timothy Furtak;José Nelson Amaral;Robert Niewiadomski
Affiliations:
University of Alberta, Edmonton, AB, Canada;University of Alberta, Edmonton, AB, Canada;University of Alberta, Edmonton, AB, Canada
Venue:
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Year:
2007

Citing 16
Cited 7

A taxonomy of parallel sorting

ACM Computing Surveys (CSUR)
An empirical comparison of priority-queue and event-set implementations

Communications of the ACM
The influence of caches on the performance of heaps

Journal of Experimental Algorithmics (JEA)
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
A fast Fourier transform compiler

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
The influence of caches on the performance of sorting

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
SPL: a language and compiler for DSP algorithms

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Designing a PC Game Engine

IEEE Computer Graphics and Applications
The universality of various types of SIMD machine interconnection networks

ISCA '77 Proceedings of the 4th annual symposium on Computer architecture
Efficient sorting using registers and caches

Journal of Experimental Algorithmics (JEA)
A Dynamically Tuned Sorting Library

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Adaptive Data Partition for Sorting Using Probability Distribution

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Optimizing Sorting with Genetic Algorithms

Proceedings of the international symposium on Code generation and optimization
Register saturation in instruction level parallelism

International Journal of Parallel Programming
Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation

Shared Register File Based ILP for Multicore

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Sorting networks on FPGAs

The VLDB Journal — The International Journal on Very Large Data Bases
A high-performance sorting algorithm for multicore single-instruction multiple-data processors

Software—Practice & Experience
Shallow bounding volume hierarchies for fast SIMD ray tracing of incoherent rays

EGSR'08 Proceedings of the Nineteenth Eurographics conference on Rendering
Highly Parallelable Bidimensional Median Filter for Modern Parallel Programming Models

Journal of Signal Processing Systems
Register level sort algorithm on multi-core SIMD processors

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery - vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sorting algorithms. When the number of elements to be sorted reaches a set threshold, data is loaded into the vector registers, manipulated in-register, and the result stored back to memory. Three implementations of sorting with two different SIMD machineries - x86-64's SSE2 and G5's AltiVec - demonstrate that this idea delivers significant speed improvements. The improvements provided are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm [11]. When integrated with the Dynamically Tuned Sorting Library (DTSL) this new code generation strategy reduces the time spent by DTSL up to 22% for moderately-sized arrays, with greater relative reductions for small arrays. Wall-clock performance of d-heaps is improved by up to 39% using a similar technique.