CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Overlapping communication and computation with OpenMP and MPI
Scientific Programming
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Scientific Programming
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Design and implementation of message-passing services for the Blue Gene/L supercomputer
IBM Journal of Research and Development
Quantifying the potential task-based dataflow parallelism in MPI applications
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A novel shared-memory thread-pool implementation for hybrid parallel CFD solvers
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Integrating MPI with asynchronous task parallelism
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Delta Send-Recv for Dynamic Pipelining in MPI Programs
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A scalable framework for heterogeneous GPU-based clusters
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Performance models for asynchronous data transfers on consumer Graphics Processing Units
Journal of Parallel and Distributed Computing
Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Bamboo: translating MPI applications to a latency-tolerant, data-driven form
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Communication overhead is one of the dominant factors affecting performance in high-end computing systems. To reduce the negative impact of communication, programmers overlap communication and computation by using asynchronous communication primitives. This increases code complexity, requiring more development effort and making less readable programs. This paper presents the hybrid use of MPI and SMPSs (SMP superscalar, a task-based shared-memory programming model), allowing the programmer to easily introduce the asynchrony necessary to overlap communication and computation. We also describe implementation issues in the SMPSs run time that support its efficient interoperation with MPI. We demonstrate the hybrid use of MPI/SMPSs with four application kernels (matrix multiply, Jacobi, conjugate gradient and NAS BT) and with the high-performance LINPACK benchmark. For the application kernels, the hybrid MPI/SMPSs versions significantly improve the performance of the pure MPI counterparts. For LINPACK we get close to the asymptotic performance at relatively small problem sizes and still get significant benefits at large problem sizes. In addition, the hybrid MPI/SMPSs approach substantially reduces code complexity and is less sensitive to network bandwidth and operating system noise than the pure MPI versions.