A Practical Approach for Performance Analysis of Shared-Memory Programs

Authors:
Bogdan Marius Tudor;Yong Meng Teo
Affiliations:
-;-
Venue:
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Year:
2011

Citing 0
Cited 1

On understanding the energy consumption of ARM-based multicore servers

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel programming has transcended from HPC into mainstream, enabled by a growing number of programming models, languages and methodologies, as well as the availability of multicore systems. However, performance analysis of parallel programs is still difficult, especially for large and complex programs, or applications developed using different programming models. This paper proposes a simple analytical model for studying the speedup of shared-memory programs on multicore systems. The proposed model derives the speedup and speedup loss from data dependency and memory overhead for various configurations of threads, cores and memory access policies in UMA and NUMA systems. The model is practical because it uses only generally available and non-intrusive inputs derived from the trace of the operating system run-queue and hardware events counters. Using six OpenMP HPC dwarfs from the NPB benchmark, our model differs from measurement results on average by 9% for UMA and 11% on NUMA. Our analysis shows that speedup loss is dominated by memory contention, especially for larger problem sizes. For the worst performing structured grid dwarf on UMA, memory contention accounts for up to 99% of the speedup loss. Based on this insight, we apply our model to determine the optimal number of cores that alleviates memory contention, maximizing speedup and reducing execution time.