Performance analysis and optimization of MPI collective operations on multi-core clusters

Authors:
Bibo Tu;Jianping Fan;Jianfeng Zhan;Xiaofang Zhao
Affiliations:
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100190;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100190 and Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China 518067;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100190;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100190
Venue:
The Journal of Supercomputing
Year:
2012

Citing 21
Cited 2

LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
LogP: a practical model of parallel computation

Communications of the ACM
LoPC: modeling contention in parallel algorithms

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
LoGPC: modeling network contention in message-passing programs

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
MagPIe: MPI's collective communication operations for clustered wide area systems

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatically tuned collective communications

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
LogGPS: a parallel computational model for synchronization analysis

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
MPI-StarT: delivering network performance to numerical applications

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Fast Measurement of LogP Parameters for Message Passing Platforms

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Quantifying Locality Effect in Data Access Delay: Memory logP

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Predicting and Evaluating Distributed Communication Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Performance Analysis of MPI Collective Operations

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Automatic generation and tuning of MPI collective communication routines

Proceedings of the 19th annual international conference on Supercomputing
High performance RDMA-based MPI implementation over infiniBand

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
$\log_{\rm n}{\rm P}$ and $\log_{3}{\rm P}$: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems

IEEE Transactions on Computers
MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
An evaluation of OpenMP on current and emerging multithreaded/multicore processors

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Global combine on mesh architectures with wormhole routing

IPPS '93 Proceedings of the 1993 Seventh International Parallel Processing Symposium

Design of efficient Java message-passing collectives on multi-core clusters

The Journal of Supercomputing
Automatic performance debugging of SPMD-style parallel programs

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Memory hierarchy on multi-core clusters has twofold characteristics: vertical memory hierarchy and horizontal memory hierarchy. This paper proposes new parallel computation model to unitedly abstract memory hierarchy on multi-core clusters in vertical and horizontal levels. Experimental results show that new model can predict communication costs for message passing on multi-core clusters more accurately than previous models, only incorporated vertical memory hierarchy. The new model provides the theoretical underpinning for the optimal design of MPI collective operations. Aimed at horizontal memory hierarchy, our methodology for optimizing collective operations on multi-core clusters focuses on hierarchical virtual topology and cache-aware intra-node communication, incorporated into existing collective algorithms in MPICH2. As a case study, multi-core aware broadcast algorithm has been implemented and evaluated. The results of performance evaluation show that the above methodology for optimizing collective operations on multi-core clusters is efficient.