Optimized Pipelined Parallel Merge Sort on the Cell BE

Authors:
Jörg Keller;Christoph W. Kessler
Affiliations:
Dept. of Math. and Computer Science, FernUniversität in Hagen, Hagen, Germany 58084;Dept. of Computer and Inf. Science, Linköpings Universitet, Linköping, Sweden 58183
Venue:
Euro-Par 2008 Workshops - Parallel Processing
Year:
2009

Citing 7
Cited 1

An introduction to parallel algorithms

An introduction to parallel algorithms
Parallel sorting by regular sampling

Journal of Parallel and Distributed Computing
Parallel Sorting Algorithms

Parallel Sorting Algorithms
Exploring the Design Space of Future CMPs

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
CellSort: high performance sorting on the cell processor

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development

A high-performance sorting algorithm for multicore single-instruction multiple-data processors

Software—Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chip multiprocessors designed for streaming applications such as Cell BE offer impressive peak performance but suffer from limited bandwidth to off-chip main memory. As the number of cores is expected to rise further, this bottleneck will become more critical in the coming years. Hence, memory-efficient algorithms are required. As a case study, we investigate parallel sorting on Cell BE as a problem of great importance and as a challenge where the ratio between computation and memory transfer is very low. Our previous work led to a parallel mergesort that reduces memory bandwidth requirements by pipelining between SPEs, but the allocation of SPEs was rather ad-hoc. In our present work, we investigate mappings of merger nodes to SPEs. The mappings are designed to provide optimal trade-offs between load balancing, buffer memory consumption, and communication load on the on-chip bus. We solve this multi-objective optimization problem by deriving an integer linear programming formulation and compute Pareto-optimal solutions for the mapping of merge trees with up to 127 merger nodes. For mapping larger trees, we give a fast divide-and-conquer based approximation algorithm. We evaluate the sorting algorithm resulting from our mappings by a discrete event simulation.