Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
Achieving Full Parallelism Using Multidimensional Retiming
IEEE Transactions on Parallel and Distributed Systems
Fusion of Loops for Parallelism and Locality
IEEE Transactions on Parallel and Distributed Systems
Optimal weighted loop fusion for parallel programs
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling of uniform multidimensional systems under resource constraints
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A tile selection algorithm for data locality and cache interference
ICS '99 Proceedings of the 13th international conference on Supercomputing
Optimal two level partitioning and loop scheduling for hiding memory latency for DSP applications
Proceedings of the 37th Annual Design Automation Conference
On Uniformization of Affine Dependence Algorithms
IEEE Transactions on Computers
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Loop Scheduling and Partitions for Hiding Memory Latencies
Proceedings of the 12th international symposium on System synthesis
Compiler Transformations for High-Performance Computing
Compiler Transformations for High-Performance Computing
An analytical model for loop tiling and its solution
ISPASS '00 Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software
Register aware scheduling for distributed cache clustered architecture
ASP-DAC '03 Proceedings of the 2003 Asia and South Pacific Design Automation Conference
Performance advantage of reconfigurable cache design on multicore processor systems
International Journal of Parallel Programming
Comprehensive cache performance tuning with a toolset
Future Generation Computer Systems
Hi-index | 0.00 |
With the widening performance gap between processors and main memory, efficient memory accessing behavior is necessary for good program performance. Loop partition is an effective way to exploit the data locality. Traditional loop partition techniques, however, consider only a singleton nested loop. This paper presents multiple loop partition scheduling technique, which combines the loop partition and data padding to generate the detailed partition schedule. The computation and data prefetching are balanced in the partition schedule, such that the long memory latency can be hidden efficiently. Multiple loop partition scheduling explores parallelism among computations, and exploit the data locality between different loop nests as well in each loop nest. Data padding is applied in our technique to eliminate the cache interference, which overcomes the problem of cache conflict misses arisen from loop partition. Therefore, our technique can be applied in architectures with low associativity cache. The experiments show that multiple loop partition scheduling can achieve the significant improvement over the existing methods