Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
Interprocedural dependence analysis and parallelization
SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
A global approach to detection of parallelism
A global approach to detection of parallelism
Multilevel cache hierarchies: organizations, protocols, and performance
Journal of Parallel and Distributed Computing
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Dynamic Processor Self-Scheduling for General Parallel Nested Loops
IEEE Transactions on Computers
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Formalized methodology for data reuse exploration in hierarchical memory mappings
ISLPED '97 Proceedings of the 1997 international symposium on Low power electronics and design
An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing
IEEE Transactions on Computers
A Software Approach to Avoiding Spatial Cache Collisions in Parallel Processor Systems
IEEE Transactions on Parallel and Distributed Systems
Power and Speed-Efficient Code Transformation of Video Compression Algorithms for RISC Processors
Journal of VLSI Signal Processing Systems - Special issue on multimedia signal processing
Formal model of data reuse analysis for hierarchical memory organizations
Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design
Embedded Systems Design
Hi-index | 14.98 |
Parallel processing systems with cache or local memory in the memory hierarchies are considered. These systems have a local cache memory in each processor and usually employ a write-invalidate protocol for the cache coherence. In such systems, a problem called 'cache or local memory thrashing' can arise in executions of parallel programs, when the data unnecessarily moves back and forth between the caches or local memories in different processors. An approach to eliminate, or at least to reduce, such movement for nested parallel loops is presented. It is based on relations between array element accesses and enclosed loop indexes in the loops. The relations can be used to assign processors to execute the appropriate iterations for parallel loops in the loop nests with respect to the data in their caches or local memories. An algorithm for calculating the correct iteration of the parallel loop in terms of loop indexes of the previous iterations executed in the processor is presented. This method benefits parallel code with nested loop structures in a wide range of applications. The experimental results show that the technique can achieve speedups up to 2.