MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters

Authors:
Hu Chen;Wenguang Chen;Jian Huang;Bob Robert;H. Kuhn
Affiliations:
Intel China Research Center;Tsinghua University;Advanced Parallel Software Platforms, Intel Corp.;Advanced Parallel Software Platforms, Intel Corp.;Advanced Parallel Software Platforms, Intel Corp.
Venue:
Proceedings of the 20th annual international conference on Supercomputing
Year:
2006

Citing 14
Cited 13

A Mapping Strategy for Parallel Processing

IEEE Transactions on Computers
Heuristic Algorithms for Task Assignment in Distributed Systems

IEEE Transactions on Computers
An efficient K-way graph partitioning algorithm for task allocation in parallel computing systems

ISCI '90 Proceedings of the first international conference on systems integration on Systems integration '90
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
MagPIe: MPI's collective communication operations for clustered wide area systems

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimization and scaling of shared-memory and message-passing implementations of the ZEUS hydrodynamics algorithm

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A survey of graph layout problems

ACM Computing Surveys (CSUR)
Mesh Partitioning: A Multilevel Balancing and Refinement Algorithm

SIAM Journal on Scientific Computing
Implementing the MPI process topology mechanism

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A new scheduling strategy for NUMA multiprocessor systems

ICPADS '96 Proceedings of the 1996 International Conference on Parallel and Distributed Systems
Effect of Communication Latency, Overhead, and Bandwidth on a Cluster

Effect of Communication Latency, Overhead, and Bandwidth on a Cluster
Graph Partitioning with the Party Library: Helpful-Sets in Practice

SBAC-PAD '04 Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing
Communicating efficiently on cluster based grids with MPICH-VMI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing

Process Mapping for MPI Collective Communications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Scalable computing with parallel tasks

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
FACT: fast communication trace collection for parallel applications through program slicing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Near-optimal placement of MPI processes on hierarchical NUMA architectures

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Improving MPI applications performance on multicore clusters with rank reordering

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite

Computers and Electrical Engineering
The impact of heterogeneous multi-core clusters on graph partitioning: an empirical study

Cluster Computing
Topology aware process mapping

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Optimized process placement for collective I/O operations

Proceedings of the 20th European MPI Users' Group Meeting
The Servet 3.0 benchmark suite: Characterization of network performance degradation

Computers and Electrical Engineering
Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols

Journal of Parallel and Distributed Computing
Improving inter-node communications in multi-core clusters using a contention-free process mapping algorithm

The Journal of Supercomputing
Combined scheduling and mapping for scalable computing with parallel tasks

Scientific Programming - Biological Knowledge Discovery and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

SMP clusters and multiclusters are widely used to execute message-passing parallel applications. The ways to map parallel processes to processors (or cores) could affect the application performance significantly due to the non-uniform communicating cost in such systems. It is desired to have a tool to map parallel processes to processors (or cores) automatically.Although there have been various efforts to address this issue, the existing solutions either require intensive user intervention, or can not be able to handle the situation of multiclusters well.In this paper, we propose a profile-guided approach to find the optimized mapping automatically to minimize the cost of point-to-point communications for arbitrary message passing applications. The implemented toolset is called MPIPP (MPI Process Placement toolset), and it includes several components:1) A tool to get the communication profile of MPI applications2) A tool to get the network topology of target clusters3) An algorithm to find optimized mapping, which is especially more effective than existing graph partition algorithms for multiclusters.We evaluated the performance of our tool with the NPB benchmarks and three other applications in several clusters. Experimental results show that the optimized process placement generated by our tools can achieve significant speedup.