Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
LogP: a practical model of parallel computation
Communications of the ACM
LoPC: modeling contention in parallel algorithms
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
LoGPC: modeling network contention in message-passing programs
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
MagPIe: MPI's collective communication operations for clustered wide area systems
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatically tuned collective communications
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
LogGPS: a parallel computational model for synchronization analysis
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
MPI-StarT: delivering network performance to numerical applications
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Fast Measurement of LogP Parameters for Message Passing Platforms
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Quantifying Locality Effect in Data Access Delay: Memory logP
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Predicting and Evaluating Distributed Communication Performance
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Performance Analysis of MPI Collective Operations
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Automatic generation and tuning of MPI collective communication routines
Proceedings of the 19th annual international conference on Supercomputing
High performance RDMA-based MPI implementation over infiniBand
International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
An evaluation of OpenMP on current and emerging multithreaded/multicore processors
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Global combine on mesh architectures with wormhole routing
IPPS '93 Proceedings of the 1993 Seventh International Parallel Processing Symposium
Design of efficient Java message-passing collectives on multi-core clusters
The Journal of Supercomputing
Automatic performance debugging of SPMD-style parallel programs
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Memory hierarchy on multi-core clusters has twofold characteristics: vertical memory hierarchy and horizontal memory hierarchy. This paper proposes new parallel computation model to unitedly abstract memory hierarchy on multi-core clusters in vertical and horizontal levels. Experimental results show that new model can predict communication costs for message passing on multi-core clusters more accurately than previous models, only incorporated vertical memory hierarchy. The new model provides the theoretical underpinning for the optimal design of MPI collective operations. Aimed at horizontal memory hierarchy, our methodology for optimizing collective operations on multi-core clusters focuses on hierarchical virtual topology and cache-aware intra-node communication, incorporated into existing collective algorithms in MPICH2. As a case study, multi-core aware broadcast algorithm has been implemented and evaluated. The results of performance evaluation show that the above methodology for optimizing collective operations on multi-core clusters is efficient.