Two algorithms for barrier synchronization
International Journal of Parallel Programming
Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
U-Net: a user-level network interface for parallel and distributed computing
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Global reduction in wormhole k-ary n-cube networks with multidestination exchange worms
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
One-to-all personalized communication in torus networks
PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
Performance without pain = productivity: data layout and collective communication in UPC
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of a cortical simulator
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Large-scale maximum likelihood-based phylogenetic analysis on the IBM BlueGene/L
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A study of the effects of machine geometry and mapping on distributed transpose performance
Proceedings of the 5th conference on Computing frontiers
Identifying, tabulating, and analyzing contacts between branched neuron morphologies
IBM Journal of Research and Development
Architecture of the Component Collective Messaging Interface
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPI Reduction Operations for Sparse Floating-point Data
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Bandwidth optimal all-reduce algorithms for clusters of workstations
Journal of Parallel and Distributed Computing
Efficient high performance collective communication for the cell blade
Proceedings of the 23rd international conference on Supercomputing
MPI collective communications on the blue gene/p supercomputer: algorithms and optimizations
Proceedings of the 23rd international conference on Supercomputing
Fine-Grained Data Distribution Operations for Particle Codes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Interconnection network simulation using traces of MPI applications
International Journal of Parallel Programming
Architecture of the Component Collective Messaging Interface
International Journal of High Performance Computing Applications
Optimizing collective communication on multicores
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Exploiting 162-Nanosecond End-to-End Communication Latency on Anton
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
An in-place algorithm for irregular all-to-all communication with limited memory
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Parallel implementation of the replica exchange molecular dynamics algorithm on blue gene/L
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Collective operations in NEC's high-performance MPI libraries
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Piccolo: building fast, distributed programs with partitioned tables
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Hiding latency in Coarray Fortran 2.0
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Kernel-based offload of collective operations: implementation, evaluation and lessons learned
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Efficient allgather for regular SMP-Clusters
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Performance measurements of the 3D FFT on the blue gene/l supercomputer
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Low-Overhead, high-speed multi-core barrier synchronization
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Composable, non-blocking collective operations on power7 IH
Proceedings of the 26th ACM international conference on Supercomputing
Design and Implementation of Portable and Efficient Non-blocking Collective Communication
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
High-performance RMA-based broadcast on the intel SCC
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Exploiting atomic operations for barrier on cray XE/XK systems
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
The impact of system design parameters on application noise sensitivity
Cluster Computing
The design of ultra scalable MPI collective communication on the K computer
Computer Science - Research and Development
A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters
The Journal of Supercomputing
Trends and outlook for the massive-scale analytics stack
IBM Journal of Research and Development
Hi-index | 0.00 |
BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the the natural programming model for this machine. The BlueGene/L MPI library is a port of MPICH2.In this paper we discuss the implementation of MPI collectives on BlueGene/L. The MPICH2 implementation of MPI collectives is based on point-to-point communication primitives. This turns out to be suboptimal for a number of reasons. Machine-optimized MPI collectives are necessary to harness the performance of BlueGene/L. We discuss these optimized MPI collectives, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real BlueGene/L hardware with up to 4096 compute nodes.