A Partitioning Strategy for Nonuniform Problems on Multiprocessors
IEEE Transactions on Computers
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Principles of runtime support for parallel processors
ICS '88 Proceedings of the 2nd international conference on Supercomputing
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Characterizing the behavior of sparse algorithms on caches
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
ICS '93 Proceedings of the 7th international conference on Supercomputing
Compiler-directed data prefetching in multiprocessors with memory hierarchies
ICS '90 Proceedings of the 4th international conference on Supercomputing
Compiling Parallel Loops for High Performance Computers: Partitioning, Data Assignment, and Remapping
Parallelizing Loops with Indirect Array References of Pointers
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Efficient support for irregular applications on distributed-memory machines
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Run-Time Reference Clustering for Cache Performance Optimization
PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
Hi-index | 0.00 |
Applications with irregular data structures such as sparse matrices or finite element meshes account for a large fraction of engineering and scientific applications. Domain decomposition techniques are commonly used to partition these applications to reduce interprocessor communication on message passing parallel systems. Our work investigates the use of domain decomposition techniques on cache-coherent parallel systems.Many good domain decomposition algorithms are now available. We show that further application improvements are attainable using data and program restructuring in conjunction with domain decomposition. We give techniques for data layout to reduce communication, blocking with subdomains to improve uniprocessor cache behavior, and insertion of prefetches to hide the latency of interprocessor communication.This paper details our restructuring techniques and provides experimental results on the KSR1 multiprocessor for a sparse matrix application. The experimental results include counts of cache misses provided by the KSR PMON performance monitoring tool. Our data show that cache coherency traffic can be reduced by 30%–60% using our data layout scheme and that more than 53% of the remaining coherency cache misses can be eliminated using prefetch instructions.