The periodic balanced sorting network
Journal of the ACM (JACM)
A comparison of sorting algorithms for the connection machine CM-2
SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Radix sort for vector multiprocessors
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Parallel Sorting Algorithms
Implementing database operations using SIMD instructions
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
AlphaSort: a cache-sensitive parallel external sort
The VLDB Journal — The International Journal on Very Large Data Bases
Photon mapping on programmable graphics hardware
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems)
Implementing sorting in database systems
ACM Computing Surveys (CSUR)
Optimizing data permutations for SIMD devices
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
GPUTeraSort: high performance graphics co-processor sorting for large database management
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
GPU-ABiSort: optimal parallel sorting on stream architectures
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Cell-SWat: modeling and scheduling wavefront computations on the cell broadband engine
Proceedings of the 5th conference on Computing frontiers
Dma-based prefetching for i/o-intensive workloads on the cell architecture
Proceedings of the 5th conference on Computing frontiers
Data mining on the cell broadband engine
Proceedings of the 22nd annual international conference on Supercomputing
SPADE: the system s declarative stream processing engine
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient implementation of sorting on multi-core SIMD CPU architecture
Proceedings of the VLDB Endowment
Celling SHIM: compiling deterministic concurrency to a heterogeneous multicore
Proceedings of the 2009 ACM symposium on Applied Computing
Supporting MapReduce on large-scale asymmetric multi-core clusters
ACM SIGOPS Operating Systems Review
Optimized Pipelined Parallel Merge Sort on the Cell BE
Euro-Par 2008 Workshops - Parallel Processing
Optimized on-chip pipelining of memory-intensive computations on the cell BE
ACM SIGARCH Computer Architecture News
FPGA: what's in it for a database?
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
k-ary search on modern processors
Proceedings of the Fifth International Workshop on Data Management on New Hardware
Carbon nanotube coated high-throughput neurointerfaces in assistive environments
Proceedings of the 2nd International Conference on PErvasive Technologies Related to Assistive Environments
A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore Platform
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Relational query coprocessing on graphics processors
ACM Transactions on Database Systems (TODS)
Proceedings of the VLDB Endowment
Suffix tree construction algorithms on modern hardware
Proceedings of the 13th International Conference on Extending Database Technology
FPGAs: a new point in the database design space
Proceedings of the 13th International Conference on Extending Database Technology
Optimization of BLAS on the cell processor
HiPC'08 Proceedings of the 15th international conference on High performance computing
Designing Accelerator-Based Distributed Systems for High Performance
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
MapReduce for the cell broadband engine architecture
IBM Journal of Research and Development
Recursion-driven parallel code generation for multi-core platforms
Proceedings of the Conference on Design, Automation and Test in Europe
Optimized on-chip-pipelined mergesort on the cell/B.E.
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Long DNA sequence comparison on multicore architectures
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
A capabilities-aware framework for using computational accelerators in data-intensive computing
Journal of Parallel and Distributed Computing
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
The VLDB Journal — The International Journal on Very Large Data Bases
A high-performance sorting algorithm for multicore single-instruction multiple-data processors
Software—Practice & Experience
Computers and Electrical Engineering
Bitonic sort on a chained-cubic tree interconnection network
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
In this paper we describe the design and implementation of CellSort - a high performance distributed sort algorithm for the Cell processor. We design CellSort as a distributed bitonic merge with a data-parallel bitonic sorting kernel. In order to best exploit the architecture of the Cell processor and make use of all available forms of parallelism to achieve good scalability, we structure CellSort as a three-tiered sort. The first tier is a SIMD (single-instruction multiple data) optimized bitonic sort, which sorts up to 128KB of items that cat fit into one SPE's (a co-processor on Cell) local store. We design a comprehensive SIMDization scheme that employs data parallelism even for the most fine-grained steps of the bitonic sorting kernel. Our results show that, SIMDized bitonic sorting kernel is vastly superior to other alternatives on the SPE and performs up to 1.7 times faster compared to quick sort on 3.2GHz Intel Xeon. The second tier is an in-core bitonic merge optimized for cross-SPE data transfers via asynchronous DMAs, and sorts enough number of items that can fit into the cumulative space available on the local stores of the participating SPEs. We design data transfer and synchronization patters that minimize serial sections of the code by taking advantage of the high aggregate cross-SPE bandwidth available on Cell. Results show that, in-core bitonic sort scales well on the Cell processor with increasing number of SPEs, and performs up to 10 times faster with 16 SPEs compared to parallel quick sort on dual-3.2 GHz Intel Xeon. The third tier is an out-of-core bitonic merge which sorts large number of items stored in the main memory. Results show that, when properly implemented, distributed out-of-core bitonic sort on Cell can significantly outperform the asymptotically (average case) superior quick sort for large number of memory resident items (up to 4 times faster when sorting 0.5GB of data with 16 SPEs, compared to dual-3.2GHz Intel Xeon).