Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Authors:
Vladimir Marjanović;Jesús Labarta;Eduard Ayguadé;Mateo Valero
Affiliations:
Barcelona Supercomputing Center (BSC-CNS);Barcelona Supercomputing Center (BSC-CNS) and Technical University of Catalunya (UPC);Barcelona Supercomputing Center (BSC-CNS) and Technical University of Catalunya (UPC);Barcelona Supercomputing Center (BSC-CNS) and Technical University of Catalunya (UPC)
Venue:
Proceedings of the 24th ACM International Conference on Supercomputing
Year:
2010

Citing 8
Cited 9

CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Overlapping communication and computation with OpenMP and MPI

Scientific Programming
Optimizing a conjugate gradient solver with non-blocking collective operations

Parallel Computing
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Design and implementation of message-passing services for the Blue Gene/L supercomputer

IBM Journal of Research and Development

Quantifying the potential task-based dataflow parallelism in MPI applications

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A novel shared-memory thread-pool implementation for hybrid parallel CFD solvers

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Integrating MPI with asynchronous task parallelism

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Delta Send-Recv for Dynamic Pipelining in MPI Programs

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A scalable framework for heterogeneous GPU-based clusters

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Performance models for asynchronous data transfers on consumer Graphics Processing Units

Journal of Parallel and Distributed Computing
Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Bamboo: translating MPI applications to a latency-tolerant, data-driven form

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An integrated fine-grain runtime system for MPI

Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Communication overhead is one of the dominant factors affecting performance in high-end computing systems. To reduce the negative impact of communication, programmers overlap communication and computation by using asynchronous communication primitives. This increases code complexity, requiring more development effort and making less readable programs. This paper presents the hybrid use of MPI and SMPSs (SMP superscalar, a task-based shared-memory programming model), allowing the programmer to easily introduce the asynchrony necessary to overlap communication and computation. We also describe implementation issues in the SMPSs run time that support its efficient interoperation with MPI. We demonstrate the hybrid use of MPI/SMPSs with four application kernels (matrix multiply, Jacobi, conjugate gradient and NAS BT) and with the high-performance LINPACK benchmark. For the application kernels, the hybrid MPI/SMPSs versions significantly improve the performance of the pure MPI counterparts. For LINPACK we get close to the asymptotic performance at relatively small problem sizes and still get significant benefits at large problem sizes. In addition, the hybrid MPI/SMPSs approach substantially reduces code complexity and is less sensitive to network bandwidth and operating system noise than the pure MPI versions.