Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP

Authors:
Yong Yan;Xiaodong Zhang;Zhao Zhang
Affiliations:
Hewlett Packard Labs, Palo Alto, CA;College of William and Mary, Williamsburg, VA;College of William and Mary, Williamsburg, VA
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2000

Citing 21
Cited 5

Guided self-scheduling: A practical scheduling scheme for parallel supercomputers

IEEE Transactions on Computers
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Data locality and load balancing in COOL

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A quantitative analysis of loop nest locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Using the SimOS machine simulator to study complex computer systems

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Adaptively Scheduling Parallel Loops in Distributed Shared-Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
The Starfire SMP interconnect

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
A memory-layout oriented run-time technique for locality optimization

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors

MASCOTS '94 Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems

Improving memory performance of sorting algorithms

Journal of Experimental Algorithmics (JEA)
Auto-CFD-NOW: A pre-compiler for effectively parallelizing CFD applications on networks of workstations

The Journal of Supercomputing
Feedback-directed thread scheduling with memory considerations

Proceedings of the 16th international symposium on High performance distributed computing
Runtime characterisation of irregular accesses applied to parallelisation of irregular reductions

International Journal of Computational Science and Engineering
Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Exploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memory-access space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-preserved way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime library on two commercial SMP servers and an SimOS simulated SMP. Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. These types of programs are usually hard to optimize by static compiler optimizations.