Data centric cache measurement using hardware and software instrumentation

  • Authors:
  • Bryan R. Buck;Jeffrey K. Hollingsworth

  • Affiliations:
  • -;-

  • Venue:
  • Data centric cache measurement using hardware and software instrumentation
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The speed at which microprocessors can perform computations is increasing faster than the speed of access to main memory, making efficient use of memory caches ever more important. Because of this, information about the cache behavior of applications is valuable for performance tuning. To be most useful to a programmer, this information should be presented in a way that relates it to data structures at the source code level; we will refer to this as data centric cache information. This dissertation examines the problem of how to collect such information. We describe techniques for accomplishing this using hardware performance monitors and software instrumentation. We discuss both performance monitoring features that are present in existing processors and a proposed feature for future designs. The first technique we describe uses sampling of cache miss addresses, relating them to data structures. We present the results of experiments using an implementation of this technique inside a simulator, which show that it can collect the desired information accurately and with low overhead. We then discuss a tool called Cache Scope that implements this on actual hardware, the Intel Itanium 2 processor. Experiments with this tool validate that perturbation and overhead can be kept low in a real-world setting. We present examples of tuning the performance of two applications based on data from this tool. By changing only the layout of data structures, we achieved approximately 24% and 19% reductions in running time. We also describe a technique that uses a proposed hardware feature that provides information about cache evictions to sample eviction addresses. We present results from an implementation of this technique inside a simulator, showing that even though this requires storing considerably more data than sampling cache misses, we are still able to collect information accurate enough to be useful while keeping overhead low. We discuss an example of performance tuning in which we were able to reduce the running time of an application by 8% using information gained from this tool.