A Mapping Strategy for Parallel Processing
IEEE Transactions on Computers
Heuristic Algorithms for Task Assignment in Distributed Systems
IEEE Transactions on Computers
An efficient K-way graph partitioning algorithm for task allocation in parallel computing systems
ISCI '90 Proceedings of the first international conference on systems integration on Systems integration '90
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
MagPIe: MPI's collective communication operations for clustered wide area systems
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A survey of graph layout problems
ACM Computing Surveys (CSUR)
Mesh Partitioning: A Multilevel Balancing and Refinement Algorithm
SIAM Journal on Scientific Computing
Implementing the MPI process topology mechanism
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A new scheduling strategy for NUMA multiprocessor systems
ICPADS '96 Proceedings of the 1996 International Conference on Parallel and Distributed Systems
Effect of Communication Latency, Overhead, and Bandwidth on a Cluster
Effect of Communication Latency, Overhead, and Bandwidth on a Cluster
Graph Partitioning with the Party Library: Helpful-Sets in Practice
SBAC-PAD '04 Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing
Communicating efficiently on cluster based grids with MPICH-VMI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Process Mapping for MPI Collective Communications
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Scalable computing with parallel tasks
Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
FACT: fast communication trace collection for parallel applications through program slicing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Near-optimal placement of MPI processes on hierarchical NUMA architectures
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Improving MPI applications performance on multicore clusters with rank reordering
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Computers and Electrical Engineering
Topology aware process mapping
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Optimized process placement for collective I/O operations
Proceedings of the 20th European MPI Users' Group Meeting
The Servet 3.0 benchmark suite: Characterization of network performance degradation
Computers and Electrical Engineering
Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols
Journal of Parallel and Distributed Computing
The Journal of Supercomputing
Combined scheduling and mapping for scalable computing with parallel tasks
Scientific Programming - Biological Knowledge Discovery and Data Mining
Hi-index | 0.00 |
SMP clusters and multiclusters are widely used to execute message-passing parallel applications. The ways to map parallel processes to processors (or cores) could affect the application performance significantly due to the non-uniform communicating cost in such systems. It is desired to have a tool to map parallel processes to processors (or cores) automatically.Although there have been various efforts to address this issue, the existing solutions either require intensive user intervention, or can not be able to handle the situation of multiclusters well.In this paper, we propose a profile-guided approach to find the optimized mapping automatically to minimize the cost of point-to-point communications for arbitrary message passing applications. The implemented toolset is called MPIPP (MPI Process Placement toolset), and it includes several components:1) A tool to get the communication profile of MPI applications2) A tool to get the network topology of target clusters3) An algorithm to find optimized mapping, which is especially more effective than existing graph partition algorithms for multiclusters.We evaluated the performance of our tool with the NPB benchmarks and three other applications in several clusters. Experimental results show that the optimized process placement generated by our tools can achieve significant speedup.