Detailed cache coherence characterization for OpenMP benchmarks

  • Authors:
  • Jaydeep Marathe;Anita Nagarajan;Frank Mueller

  • Affiliations:
  • North Carolina State University, Raleigh, NC;Intel Technology India Pvt. Ltd., Bangalore, India;North Carolina State University, Raleigh, NC

  • Venue:
  • Proceedings of the 18th annual international conference on Supercomputing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Past work on studying cache coherence in shared-memory symmetric multiprocessors (SMPs) concentrates on studying aggregate events, often from an architecture point of view. However, this approach provides insufficient information about the exact sources of inefficiencies in parallel applications. For SMPs in contemporary clusters, application performance is impacted by the pattern of shared memory usage, and it becomes essential to understand coherence behavior in terms of the application program constructs -- such as data structures and source code lines.The technical contributions of this work are as follows. We introduce ccSIM, a cache-coherent memory simulator fed by data traces obtained through on-the-fly dynamic binary rewriting of OpenMP benchmarks executing on a Power3 SMP node. We explore the degrees of freedom in interleaving data traces from the different processors and assess the simulation accuracy by comparing with hardware performance counters. The novelty of ccSIM lies in its ability to relate coherence traffic -- specifically coherence misses as well as their progenitor invalidations -- to data structures and to their reference locations in the source program, thereby facilitating the detection of inefficiencies. Our experiments demonstrate that (a) cache coherence traffic is simulated accurately for SPMD programming styles as its invalidation traffic closely matches the corresponding hardware performance counters, (b) we derive detailed coherence information indicating the location of invalidations in the application code, i.e, source line and data structures and (c) we illustrate opportunities for optimizations from these details. By exploiting these unique features of ccSIM, we were able to identify and locate opportunities for program transformations, including interactions with OpenMP constructs, resulting in both significantly decreased coherence misses and savings of up to 73% in wall-clock execution time for several real-world benchmarks.