CellSort: high performance sorting on the cell processor

Authors:
Buǧra Gedik;Rajesh R. Bordawekar;Philip S. Yu
Affiliations:
Thomas J. Watson Research Center, IBM Research, Hawthorne, NY;Thomas J. Watson Research Center, IBM Research, Hawthorne, NY;Thomas J. Watson Research Center, IBM Research, Hawthorne, NY
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 14
Cited 29

The periodic balanced sorting network

Journal of the ACM (JACM)
A comparison of sorting algorithms for the connection machine CM-2

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Radix sort for vector multiprocessors

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Parallel Sorting Algorithms

Parallel Sorting Algorithms
Implementing database operations using SIMD instructions

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
AlphaSort: a cache-sensitive parallel external sort

The VLDB Journal — The International Journal on Very Large Data Bases
Photon mapping on programmable graphics hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems)

GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems)
Implementing sorting in database systems

ACM Computing Surveys (CSUR)
Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
GPU-ABiSort: optimal parallel sorting on stream architectures

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Cell-SWat: modeling and scheduling wavefront computations on the cell broadband engine

Proceedings of the 5th conference on Computing frontiers
Dma-based prefetching for i/o-intensive workloads on the cell architecture

Proceedings of the 5th conference on Computing frontiers
Data mining on the cell broadband engine

Proceedings of the 22nd annual international conference on Supercomputing
SPADE: the system s declarative stream processing engine

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment
Celling SHIM: compiling deterministic concurrency to a heterogeneous multicore

Proceedings of the 2009 ACM symposium on Applied Computing
Supporting MapReduce on large-scale asymmetric multi-core clusters

ACM SIGOPS Operating Systems Review
Optimized Pipelined Parallel Merge Sort on the Cell BE

Euro-Par 2008 Workshops - Parallel Processing
Optimized on-chip pipelining of memory-intensive computations on the cell BE

ACM SIGARCH Computer Architecture News
FPGA: what's in it for a database?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
k-ary search on modern processors

Proceedings of the Fifth International Workshop on Data Management on New Hardware
Carbon nanotube coated high-throughput neurointerfaces in assistive environments

Proceedings of the 2nd International Conference on PErvasive Technologies Related to Assistive Environments
A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore Platform

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Relational query coprocessing on graphics processors

ACM Transactions on Database Systems (TODS)
Data processing on FPGAs

Proceedings of the VLDB Endowment
Suffix tree construction algorithms on modern hardware

Proceedings of the 13th International Conference on Extending Database Technology
FPGAs: a new point in the database design space

Proceedings of the 13th International Conference on Extending Database Technology
Optimization of BLAS on the cell processor

HiPC'08 Proceedings of the 15th international conference on High performance computing
Designing Accelerator-Based Distributed Systems for High Performance

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
MapReduce for the cell broadband engine architecture

IBM Journal of Research and Development
Recursion-driven parallel code generation for multi-core platforms

Proceedings of the Conference on Design, Automation and Test in Europe
Optimized on-chip-pipelined mergesort on the cell/B.E.

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Long DNA sequence comparison on multicore architectures

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
A capabilities-aware framework for using computational accelerators in data-intensive computing

Journal of Parallel and Distributed Computing
FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Sorting networks on FPGAs

The VLDB Journal — The International Journal on Very Large Data Bases
A high-performance sorting algorithm for multicore single-instruction multiple-data processors

Software—Practice & Experience
Efficient sorting design on a novel embedded parallel computing architecture with unique memory access

Computers and Electrical Engineering
Bitonic sort on a chained-cubic tree interconnection network

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe the design and implementation of CellSort - a high performance distributed sort algorithm for the Cell processor. We design CellSort as a distributed bitonic merge with a data-parallel bitonic sorting kernel. In order to best exploit the architecture of the Cell processor and make use of all available forms of parallelism to achieve good scalability, we structure CellSort as a three-tiered sort. The first tier is a SIMD (single-instruction multiple data) optimized bitonic sort, which sorts up to 128KB of items that cat fit into one SPE's (a co-processor on Cell) local store. We design a comprehensive SIMDization scheme that employs data parallelism even for the most fine-grained steps of the bitonic sorting kernel. Our results show that, SIMDized bitonic sorting kernel is vastly superior to other alternatives on the SPE and performs up to 1.7 times faster compared to quick sort on 3.2GHz Intel Xeon. The second tier is an in-core bitonic merge optimized for cross-SPE data transfers via asynchronous DMAs, and sorts enough number of items that can fit into the cumulative space available on the local stores of the participating SPEs. We design data transfer and synchronization patters that minimize serial sections of the code by taking advantage of the high aggregate cross-SPE bandwidth available on Cell. Results show that, in-core bitonic sort scales well on the Cell processor with increasing number of SPEs, and performs up to 10 times faster with 16 SPEs compared to parallel quick sort on dual-3.2 GHz Intel Xeon. The third tier is an out-of-core bitonic merge which sorts large number of items stored in the main memory. Results show that, when properly implemented, distributed out-of-core bitonic sort on Cell can significantly outperform the asymptotically (average case) superior quick sort for large number of memory resident items (up to 4 times faster when sorting 0.5GB of data with 16 SPEs, compared to dual-3.2GHz Intel Xeon).