Using Hardware Counters to Automatically Improve Memory Performance

Authors:
Mustafa M. Tikir;Jeffrey K. Hollingsworth
Affiliations:
University of Maryland at College Park;University of Maryland at College Park
Venue:
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Year:
2004

Citing 15
Cited 20

NUMA policies and their relation to memory architecture

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The robustness of NUMA memory management

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Scheduling and page migration for multiprocessor compute servers

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Performance experiences on Sun's Wildfire prototype

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Using hardware performance monitors to isolate memory bottlenecks

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Architecture and design of AlphaServer GS320

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
The sun fireplane system interconnect

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling

IEEE Transactions on Parallel and Distributed Systems
SMP system interconnect instrumentation for performance analysis

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
System Overview of the SGI Origin 200/2OOO Product Line

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
WildFire: A Scalable Path for SMPs

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
An API for Runtime Code Patching

International Journal of High Performance Computing Applications

NUMA-Aware Java Heaps for Server Applications

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks

Proceedings of the 19th annual international conference on Supercomputing
affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system

Proceedings of the 19th annual international conference on Supercomputing
Hardware profile-guided automatic page placement for ccNUMA systems

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques

ACM Transactions on Architecture and Code Optimization (TACO)
Performance metrics and ontologies for Grid workflows

Future Generation Computer Systems
Hardware monitors for dynamic page migration

Journal of Parallel and Distributed Computing
Integrating Dynamic Memory Placement with Adaptive Load-Balancing for Parallel Codes on NUMA Multiprocessors

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Enhancing operating system support for multicore processors by using hardware performance monitoring

ACM SIGOPS Operating Systems Review
Dynamic data migration for structured AMR solvers

International Journal of Parallel Programming
NUMA-aware memory manager with dominant-thread-based copying GC

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Feedback-directed page placement for ccNUMA via hardware-generated memory traces

Journal of Parallel and Distributed Computing
Geographical locality and dynamic data migration for OpenMP implementations of adaptive PDE solvers

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Understanding stencil code performance on multicore architectures

Proceedings of the 8th ACM International Conference on Computing Frontiers
Exploiting thread-data affinity in OpenMP with data access patterns

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Exploring thread and memory placement on NUMA architectures: solaris and linux, UltraSPARC/FirePlane and opteron/hypertransport

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Overseer: low-level hardware monitoring and management for Java

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Matching memory access patterns and data placement for NUMA systems

Proceedings of the Tenth International Symposium on Code Generation and Optimization
A flexible and dynamic page migration infrastructure based on hardware counters

The Journal of Supercomputing
Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we introduce a profile-driven online page migration scheme and investigate its impact on the performance of multithreaded applications. We use lightweight, inexpensive plug-in hardware counters to profile the memory access behavior of an application, and then migrate pages to memory local to the most frequently accessing processor. Using the Dyninst runtime instrumentation combined with hardware counters, we were able to add page migration capabilities to the system without having to modify the operating system kernel, or to re-compile application programs. This approach reduced the total number of non-local memory accesses of applications by up to 90%. Even on a system with small remote to local memory access latency rations, this resulted in up to 16% improvement in execution time.