Faster optimal parallel prefix sums and list ranking
Information and Computation
A bridging model for parallel computation
Communications of the ACM
An introduction to parallel algorithms
An introduction to parallel algorithms
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
List ranking and list scan on the Cray C90
Journal of Computer and System Sciences
Prefix computations on symmetric multiprocessors
Journal of Parallel and Distributed Computing
HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
The Vector Floating-Point Unit in a Synergistic Processor Element of a CELL Processor
ARITH '05 Proceedings of the 17th IEEE Symposium on Computer Arithmetic
On the Architectural Requirements for Efficient Execution of Graph Algorithms
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
A universal algorithm for sequential data compression
IEEE Transactions on Information Theory
Parallel exact inference on the cell broadband engine processor
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Hardware-accelerated components for hybrid computing systems
Proceedings of the 2008 compFrame/HPC-GECO workshop on Component based high performance
SPENK: adding another level of parallelism on the cell broadband engine
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Building high-resolution sky images using the Cell/B.E.
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Evaluating multi-core platforms for HPC data-intensive kernels
Proceedings of the 6th ACM conference on Computing frontiers
CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
Parallel exact inference on the Cell Broadband Engine processor
Journal of Parallel and Distributed Computing
Extending the cell SPE with energy efficient branch prediction
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Journal of Signal Processing Systems
Making the Best of Temporal Locality: Just-in-Time Renaming and Lazy Write-Back on the Cell/B.E
International Journal of High Performance Computing Applications
Microwave tomography for breast cancer detection on Cell broadband engine processors
Journal of Parallel and Distributed Computing
Ultrasound simulation on the cell broadband engine using the westervelt equation
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Hi-index | 0.00 |
The Sony-Toshiba-IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD co-processing units (SPEs) integrated on-chip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with non-uniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library. List ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of Software-Managed threads (SM-Threads). We apply this generic SPE work-partitioning technique to efficiently implement list ranking, and demonstrate substantial speedup in comparison to traditional cache-based microprocessors. For instance, on a 3.2GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation. Our second case study, zlib, is a data compression/decompression library that is extensively used in both scientific as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm and Huffman data encoding. We design efficient parallel algorithms for these combinatorial kernels, and exploit concurrency at multiple levels on the Cell/B.E. processor. We also present a Cell/B.E. optimized implementation of gzip, a popular file-compression application based on the zlib library. For our Cell/B.E. implementation of gzip, we achieve an average speedup of 2.9 in compression over current workstations.