Performance Impact of Process Mapping on Small-Scale SMP Clusters - A Case Study Using High Performance Linpack

  • Authors:
  • Tau Leng;Rizwan Ali;Jenwei Hsieh;Victor Mashayekhi;Reza Rooholamini

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Typically, a High Performance Computing (HPC) cluster loosely couples multiple Symmetric Multi-Processor (SMP) platforms into a single processing complex. Each SMP uses shared memory for its processors to communicate, whereas communication across SMPs goes through the intra-cluster interconnect. By analyzing the communication pattern of processes, it is possible to arrive at a mapping of processes to processors to ensure optimal communication paths for critical traffic. This critical traffic refers to the communication pattern of the program, which can be characterized by either frequency or size (or both) of the messages. To find an ideal mapping, it is imperative to understand the communication characteristics of the SMP memory system, intra-cluster interconnection, and the Message Passing Interface (MPI) program running on a cluster.Our approach is to study the ideal mapping for two classes of interconnects: 1) standard, high-volume Ethernet interconnects, and 2) proprietary, low-latency high-bandwidth interconnects. In the first installment of our work presented in this paper, we have focused on the ideal mapping for the first class.We configured a 16-node dual-processor cluster, interconnected with Fast Ethernet, Gigabit Ethernet, Giganet, and Myrinet. We used High Performance Linpack (HPL) benchmark to demonstrate that re-mapping of processes to processors (or changing the order of processors used) can affect the overall performance. The mappings are based on the HPL program analysis obtained from running a MPI profiling tool. Our results suggest that the performance of HPL using Fast Ethernet as the interconnect can be improved from 10% to 50% depending on the process mapping and the problem size. Conversely, an ad hoc mapping can adversely affect the cluster performance.