A taxonomy of parallel sorting
ACM Computing Surveys (CSUR)
An empirical comparison of priority-queue and event-set implementations
Communications of the ACM
The influence of caches on the performance of heaps
Journal of Experimental Algorithmics (JEA)
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
A fast Fourier transform compiler
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
The influence of caches on the performance of sorting
SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
SPL: a language and compiler for DSP algorithms
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
IEEE Computer Graphics and Applications
The universality of various types of SIMD machine interconnection networks
ISCA '77 Proceedings of the 4th annual symposium on Computer architecture
Efficient sorting using registers and caches
Journal of Experimental Algorithmics (JEA)
A Dynamically Tuned Sorting Library
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Adaptive Data Partition for Sorting Using Probability Distribution
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Optimizing Sorting with Genetic Algorithms
Proceedings of the international symposium on Code generation and optimization
Register saturation in instruction level parallelism
International Journal of Parallel Programming
Optimizing data permutations for SIMD devices
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Shared Register File Based ILP for Multicore
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
The VLDB Journal — The International Journal on Very Large Data Bases
A high-performance sorting algorithm for multicore single-instruction multiple-data processors
Software—Practice & Experience
Shallow bounding volume hierarchies for fast SIMD ray tracing of incoherent rays
EGSR'08 Proceedings of the Nineteenth Eurographics conference on Rendering
Highly Parallelable Bidimensional Median Filter for Modern Parallel Programming Models
Journal of Signal Processing Systems
Register level sort algorithm on multi-core SIMD processors
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery - vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sorting algorithms. When the number of elements to be sorted reaches a set threshold, data is loaded into the vector registers, manipulated in-register, and the result stored back to memory. Three implementations of sorting with two different SIMD machineries - x86-64's SSE2 and G5's AltiVec - demonstrate that this idea delivers significant speed improvements. The improvements provided are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm [11]. When integrated with the Dynamically Tuned Sorting Library (DTSL) this new code generation strategy reduces the time spent by DTSL up to 22% for moderately-sized arrays, with greater relative reductions for small arrays. Wall-clock performance of d-heaps is improved by up to 39% using a similar technique.