Scientific Programming - High Performance Computing with the Cell Broadband Engine
Parallel exact inference on the Cell Broadband Engine processor
Journal of Parallel and Distributed Computing
Fast PGAS connected components algorithms
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
An effective GPU implementation of breadth-first search
Proceedings of the 47th Design Automation Conference
Scalable Graph Exploration on Multicore Processors
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel breadth-first search on distributed memory systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
I/O-efficient data structures for colored range and prefix reporting
Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Highly scalable graph search for the Graph500 benchmark
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
A yoke of oxen and a thousand chickens for heavy lifting graph processing
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Breaking the speed and scalability barriers for graph exploration on distributed-memory machines
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Understanding parallelism in graph traversal on multi-core clusters
Computer Science - Research and Development
Hi-index | 0.01 |
Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But they also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges involved in designing a Breadth-First Search (BFS) algorithm for the Cell/B.E. processor. The proposed methodology combines a high-level algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processor-specific optimizations. Using a fine-grained global coordination strategy derived by the Bulk-Synchronous Parallel (BSP) model, we have determined an accurate performance model that has guided the implementation and the optimization of our algorithm. Our experiments on a pre-production Cell/B.E. board running at 3.2 GHz, show almost linear speedups when using multiple synergistic processing elements, and an impressive level of performance when compared to other processors. On graphs which offer sufficient parallelism, the Cell/B.E. is typically an order of magnitude faster than conventional processors, such as the AMD Opteron and the Intel Pentium 4 and Woodcrest, and custom-designed architectures, such as the MTA-2 and BlueGene/L.